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Abstract 


Many statistical learning problems have recently been shown to be amenable to Semi- 
Definite Programming (SDP), with community detection and clustering in Gaussian miz- 
ture models as the most striking instances Javanmard et al. (2016). Given the growing 
range of applications of SDP-based techniques to machine learning problems, and the rapid 
progress in the design of efficient algorithms for solving SDPs, an intriguing question is to 
understand how the recent advances from empirical process theory and Statistical Learning 
Theory can be leveraged for providing a precise statistical analysis of SDP estimators. 

In the present paper, we borrow cutting edge techniques and concepts from the Learning 
Theory literature, such as fixed point equations and excess risk curvature arguments, which 
yield general estimation and prediction results for a wide class of SDP estimators. From 
this perspective, we revisit some classical results in community detection from Guédon and 
Vershynin (2016) and Fei and Chen (2019b), and we obtain statistical guarantees for SDP 
estimators used in signed clustering, angular group synchronization (for both multiplica- 
tive and additive models) and MAX-CUT. Our theoretical findings are complemented by 
numerical experiments for each of the three problems considered, showcasing the competi- 
tiveness of the SDP estimators. 

Keywords: Semi-Definite Programming, Statistical Learning, Group Synchronization, 
Signed Clustering. 


1. Introduction 


Many statistical learning problems have recently been shown to be amenable to Semi- 
Definite Programming (SDP), with community detection and clustering in Gaussian mixture 
models as the most striking instances where SDP performs significantly better than other 
current approaches Javanmard et al. (2016). SDP is a class of convex optimization problems 
generalising linear programming to linear problems over semi-definite matrices Todd (2001), 
Wolkowicz et al. (2012), Boyd and Vandenberghe (2004), and which was proved to be an 
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important tool in the computational approach to difficult challenges in automatic control, 
combinatorial optimization, polynomial optimization, data mining, high-dimensional statis- 
tics and the numerical solution to partial differential equations. The goal of the present 
paper is to introduce a new fixed point approach to the statistical analysis of SDP-based 
estimators, and illustrate our method on four current problems of interest, namely commu- 
nity detection, signed clustering, angular group synchronization, and MAX-CuT. Our aim is 
to show that all these estimators can be viewed as special instances of Empirical Risk Min- 
imization (ERM), and can then benefit from the very large litterature on that subject. The 
rest of this section provides historical background and presents the mathematical definition 
of SDP-based estimators. 


1.1 Historical background 


SDP is a class of optimization problems which includes linear programming as a partic- 
ular case, and can be written as the set of problems over symmetric (resp. Hermitian) 
positive semi-definite matrix variables, with linear cost function and affine constraints, i.e. 
optimization problems of the form 


max ((A, Z) : (By, Z) = bj for j =1,...,m), (1) 


where A, B1, ..., Bm are given matrices. SDPs are convex programming problems which can 
be solved in polynomial time when the constraint set is compact and it plays a paramount 
role in a large number of convex and non-convex problems, for which they often appear as 
a convex relaxation Anjos and Lasserre (2011). We will occasionally use the notation Sp,+ 
(resp. Sn,—) for the cone of positive (resp. negative) semi-definite matrices. 


1.1.1 EARLY HISTORY 


Early use of Semi-Definite programming in statistics can be traced back to Scobey and 
Kabe (1978) and Fletcher (1981). In the same year, Shapiro used SDP in factor analysis 
Shapiro (1982). The study of the mathematical properties of SDP then gained momentum 
with the introduction of Linear Matrix Inequalities (LMI) and their numerous applications 
in control theory, system identification and signal processing. The book Boyd et al. (1994) 
is the standard reference of these type of results, mostly obtained in the 90’s. 


1.1.2 THE GOEMANS-WILLIAMSON SDP RELAXATION OF MAX-CUT AND ITS LEGACY 


A notable turning point is the publication of Goemans and Williamson (1995), where SDP 
was shown to provide a 0.87 approximation to the NP-Hard problem known as MAXx-CUT. 
The MAXx-CuT problem is a clustering problem on graphs which consists in finding two 
complementary subsets S$ and S° of nodes such that the sum of the weights of the edges be- 
tween S and S° is maximal. In Goemans and Williamson (1995), the authors approach this 
difficult combinatorial problem by using what is now known as the Goemans- Williamson 
SDP relaxation, and use the Choleski factorization of the optimal solution to this SDP in 
order to produce a randomized scheme achieving the 0.87 bound in expectation. Moreover, 
this problem can be seen as a first instance where the Laplacian of a graph is employed in 
order to provide an optimal bi-clustering in a graph, and certainly represents the first chap- 
ter of a long and fruitful relationship between clustering, embedding and graphs Laplacians. 
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Other SDP schemes for approximating hard combinatorial problems are, to name a few, for 
the graph coloring problem Karger et al. (1998), and the satisfiability problem Goemans 
and Williamson (1995, 1994). These results were later surveyed in Lemaréchal et al. (1995); 
Goemans (1997) and Wolkowicz (1999). The randomized scheme introduced by Goemans 
and Williamson was then further improved in order to study more general Quadratically 
Constrained Quadratic Programmes (QCQP) in various references, most notably Nesterov 
(1997); Zhang (2000) and further extended in He et al. (2008). Many applications to signal 
processing are discussed in Olsson et al. (2007), Ma (2010); one specific reduced complexity 
implementation in the form of an eigenvalue minimization problem and its application to 
binary least-squares recovery and denoising is presented in Chrétien and Corset (2009). 


1.1.3 RELAXATION OF MACHINE LEARNING AND HIGH-DIMENSIONAL STATISTICAL 
ESTIMATION PROBLEMS 


Applications of SDP to problems related to machine learning is more recent and probably 
started with the SDP relaxation of k-means in Peng and Xia (2005); Peng and Wei (2007) 
and later in Ames (2014). This approach was then further improved using a refined statis- 
tical analysis by Royer (2017) and Giraud and Verzelen. Similar methods have also been 
applied to community detection Hajek et al. (2016); Abbe et al. (2015) and for the weak 
recovery viewpoint, Guédon and Vershynin (2016). This last approach was also re-used via 
the kernel trick for the point cloud clustering Chrétien et al. (to appear). Another incarna- 
tion of SDP in machine learning is the extensive use of nuclear norm-penalized least-square 
costs as a surrogate for rank-penalization in low-rank recovery problems such as matrix 
completion in recommender systems, matrix compressed sensing, natural language process- 
ing and quantum state tomography; these topics are surveyed in Davenport and Romberg 
(2016). 


The problem of manifold learning was also addressed using SDP and is often mentioned 
as one of the most accurate approaches to the problem, let aside its computational com- 
plexity; see Weinberger et al. (2005); Weinberger and Saul (2006b,a); Hegde et al. (2012). 
Connections with the design of fast converging Markov-Chains were also exhibited in Sun 
et al. (2006). Positive semi-definite embeddings for dimensionality reduction and manifold 
learning, along with out-of-sample extensions, were recently explored in Fanuel et al. (2017). 


In a different direction, A. Singer and collaborators have recently promoted the use of 
SDP relaxation for estimation under group invariance, an active area with many applications 
Singer (2011); Bandeira et al. (2014). SDP-based relaxations have also been considered in 
Cucuringu (2015) in the context of synchronization over Zə in signed multiplex networks 
with constraints, and Cucuringu (2016) in the setting of ranking from inconsistent and 
incomplete pairwise comparisons where an SDP-based relaxation of angular synchronization 
over SO(2) outperformed a suite of state-of-the-art algorithms from the ranking literature. 
Phase recovery using SDP was studied in e.g. Waldspurger et al. (2015) and Demanet and 
Hand (2014). An extension to multi-partite clustering based on SDP was then proposed 
in Karger et al. (1998). Other important applications of SDP are, information theory 
Lovász (1979), estimation in power networks Lavaei and Low (2011), quantum tomography 
Mazziotti (2011), Gross et al. (2010) and polynomial optimization via Sums-of-squares 
relaxations Lasserre (2015); Blekherman et al. (2012). Sums of squares relaxations were 
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recently applied to statistical problems in De Castro et al. (2019); Hopkins (2018); de Castro 
et al. (2017). Extension to the field of complex numbers, with (-,-) denoting the Hermitian 
inner product, has been less extensively studied but has many interesting applications and 
comes with efficient algorithms Goemans and Williamson (1995); Gilbert and Josz (2017). 


1.2 Mathematical formulation of the problem 


The general problem we study can be stated as follows. Let A be a random matrix in 
R”*” and C c R”*” be a constraint. The object that we want to recover, for instance, the 
community membership vector in community detection, is related to an oracle defined as 














Z* € argmax (EA, Z}, (2) 
Zec 
where (A, B} = Tr(AB') = X Ai; B,; when A, B e C”*” where Z is the conjugate of z € C. 
We would like to estimate Z*, from which we can ultimately retrieve the object that really 
matters to us (for instance, by considering a singular vector associated to the largest singular 
value of Z*). To that end, we consider the following natural estimator of Z* given by 


Ê € argmax (A, Z), (3) 
ZEC 














which is simply obtained by replacing the unobserved quantity EA by the observation A. 

As pointed out, in many situations, Z* is not the object we want to estimate, but 
there is a straightforward relation between Z* and this object. For instance, consider the 
community detection problem, where the goal is to recover the class community vector 
x* e {—1,1}” of n nodes. Here, when C is well chosen, there is a close relation between 
Z* and x*, given by Z* = x*(a*)!. We therefore need a final step to estimate x* from vie 
for instance, by letting ĉ denote a top eigenvector of A , and then using the Davis-Kahan 
“sin(Q)” Theorem Davis and Kahan (1970); Yu et al. (2015) to control the estimation of 
z* by ĉ from the one of Z* by Ê. 

When the constraint C is of the form C = {Z e R”*” : Z > 0, (Z, B;) = b;,j =1,...,m}, 
where B,,..., Bm E R”*” and Z > 0 is notation for “ Z is positive semidefinite”, then (3) 
is a semidefinite program (SDP) Boyd and Vandenberghe (2004). 


Goal of the paper. The aim of the present work is to present a general approach to the 
study of the statistical properties of SDP-based estimators defined in (3). In particular, 
using our framework, one is able to obtain new (non-asymptotic) rates of convergence or 
exact reconstruction properties for a wide class of estimators obtained as a solution of a 
semidefinite program like (3). Specifically, our goal is to show that the solution to (3) 
can be analyzed in a statistical way, when EA is only partially and noisily observed in A. 
Even though the constraint C may not necessarily be the intersection of the set of PSD (or 
Hermitian) matrices with linear spaces — such as in the definition of SDP — in the following, 
a solution Z of (3) will be called a SDP estimator because, in all our examples, Z will be 
solution of a SDP. But for the sake of generality, we will only assume a minimal requirement 
on the shape of C. We also illustrate our results on a number of specific machine learning 
problems, such as various forms of clustering problems and angular group synchronization. 
Three out of the four examples worked out here are concerned with real-valued matrices. 
Only the angular synchronization problem is approached using complex matrices. 
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Our goal here is to show that various problems can be analyzed in the same way. Our 
point of view is to look at Z* as an oracle, and Z as an ERM, and therefore to analyze the 
problem from this perspective; in particular, we can benefit from the extensive literature 
on the statistical properties of ERMs. Our approach reveals a general methodology and its 
two crucial points, which are the local curvature of the excess risk and the computation of 
a complexity fixed point. 


2. Main general results for the statistical analysis of SDP estimators 


From a statistical point of view, the task remains to estimate in the most efficient way the 
oracle Z*, and to that end Z is our candidate estimator. The point of view we will use to 
evaluate how far Z is from Z* is coming from the Learning Theory literature. We therefore 
see Z as an empirical risk minimization (ERM) procedure built on a single observation 
A, where the loss function is the linear one Z € C > ¢z(A) = —(A,Z), and the oracle 
Z* is indeed the one minimizing the risk function Z e C — E@z(A) over C. Having this 














setup in mind, we can use all the machinery developed in Learning Theory (see for instance 
Vapnik (1998); Koltchinskii (2006); Massart (2007); van de Geer (2000)) to obtain rates of 
convergence for the ERM (here Ê) toward the oracle (here Z*). 

There is one key quantity driving the rate of convergence of the ERM: a fixed point 
complexity parameter. This type of parameter carries all the statistical complexity of the 
problem, and even though it is usually easy to set up, its computation can be tedious since 
it requires to control, with large probability, the supremum of empirical processes indexed 
by “localized classes”. We now define this complexity fixed point related to the problem we 
are considering here. 


Definition 1 Let0< A <1. The fixed point complexity parameter at deviation 1 — A is 





p(X) Sant. [SOP sup (A-EA,Z— Z*) < (1/2)r| >1-A]. (4) 
eC:(EA,Z*-Z) <r 























Fixed point complexity parameters have been extensively used in Learning Theory since the 
introduction of the localization argument Massart (2007); Koltchinskii (2011); van de Geer 
(2000); Birgé and Massart (1993). When they can be computed, they are preferred to the 
(global) analysis developed by Chervonenkis and Vapnik Vapnik (1998) to study ERM, since 
the latter analysis always yields slower rates given that the Vapnik-Chervonenkis bound is 
a high-probability bound on the non-localized empirical process sup zec(A —EA,Z-Z my 
which is an upper bound for r*(A) since {Z € C : (EA, Z* — Z} <r} CC. The gap between 
the two global and local analysis can be important since fast rates cannot be obtained using 
the VC approach, whereas the localization argument resulting in fixed points such as the 
one in Definition 1 may yield fast rates of convergence or even exact recovery results. 

An example of a Vapnik-Chervonenkis’s type of analysis of SDP estimators can be found 
in Guédon and Vershynin (2016) for the community detection problem. An improvement 
of the latter approach has been obtained in Fei and Chen (2019b) thanks to a localization 
argument — even though it is not stated in these words (we elaborate more on the two 
approaches from Guédon and Vershynin (2016) and Fei and Chen (2019b) in Section 3). 
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Somehow, a fixed point such as (4) is a sharp way to measure the statistical performances 
of ERM estimators, and in particular for the SDP estimators that we are considering here. 
They can even be proved to be optimal (in a minimax sense) when the noise A — EA is 
Gaussian Lecué and Mendelson (2013), and under mild conditions on the complexity of C. 

Before stating our general result, we first recall a definition of a minimal structural 
assumption on the constraint C. 














Definition 2 We say that the set C is star-shaped in Z* when for all Z € C, the segment 
[Z, Z*] is in C. 


This is a pretty mild assumption satisfied, for instance, when C is convex, which is the setup 
we will always encounter in practical applications, given that SDP estimators are usually 
introduced after a “convex relaxation” argument. Our main general statistical bound on 
SDP estimators is as follows. 


Theorem 1 We assume that the constraint C is star-shaped in Z *. Then, for all 0 < A < 
1, with probability at least 1 — A, it holds true that (EA, Ze Z) < r*(A). 














Theorem 1 applies to any type of setup where an oracle Z* is estimated by an estimator 
Z such as in (3). Its result shows that Z is almost a maximizer of the true objective function 
Z> ( EA, Z) over C up to r*(A). In particular, when r*(A) = 0, Z is exactly a maximizer 














such as Z* and, in that case, we can work with Z as if we were working with Z* without 
any loss. Then, in this ”exact reconstruction case”, the information contained about A on 
[A] is enough for inferring Z* exactly. 

Theorem 1 may be applied in many different settings; in the following, we study four 
such instances. We will apply Theorem 1 (or one of its corollary stated below) to several 
popular problems in the networks and graph signal processing literatures, namely, commu- 
nity detection Fortunato (2010) (we will mostly revisit the results in Guédon and Vershynin 
(2016) and Fei and Chen (2019b) from our perspective), signed clustering Cucuringu et al. 
(a), group synchronization Singer (2011) and MAX-CUT. 

The proof of Theorem 1 is straightforward (mostly because the loss function is linear). 
Its importance stems from the fact that it puts forward two important concepts originally 
introduced in Learning Theory, namely that the complexity of the problem comes from the 
one of the local subset C ^ {Z : (EA, Z* — Z} < r*(A)}, and that the “radius” r*(A) of the 
localization is the solution of a fixed point equation. For a setup given by a random matrix 
A and a constraint C, we should try to understand how these two ideas apply to obtain 
estimation properties of SDP estimators such as Z. That is, to understand the shape of the 
local subsets C^ {Z : (EA, Z*— Z} < r},r > 0 and the maximal oscillations of the empirical 
process Z —> (A —EA,Z-—Z y indexed by these local subsets. We will consider this task 
in three distinct problem instances. For a detailed proof of Theorem 1 (and Theorem 2 
below), we refer the reader to Appendix A. 

The main conclusion of Theorem 1 is that all the information for the problem of esti- 
mating Z* via Z is contained in the fixed point r*(A). We therefore have to compute or 
upper bound such a fixed point. This might be difficult in great generality but there are 
some tools that can help to find upper bounds on r*(A). 
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A first approach is to understand the shape of the local sets C ^ {Z : ( EA, Z* — Z) < 
r},r > 0, and to that end, it is helpful to characterize the curvature of the excess risk 
Z > (EA, Z* -Z ) around its maximizer Z*. This type of local characterization of the 
excess risk is also a tool used in Learning Theory that goes back to classical conditions such 
as the Margin assumption Tsybakov (2004); Mammen and Tsybakov (1999) or the Bernstein 
condition Bartlett and Mendelson (2006). The latter condition was initially introduced as 
an upper bound of the variance term by its expectation: for all Z e C, E(€4(Z)—l4(Z*))? < 
coE(€4(Z) — £4(Z*)) for some absolute constant co, but it has now been better understood 
as a way to discriminate the oracle from the other points in the model C. These assumptions 
were global assumption in the sense that they concern all Z in C. It has been recently shown 
Chinot et al. (2018) that only the local curvature of the excess risk needs to be understood. 
We now introduce this tool in our setup. 

We characterize the local curvature of the excess risk by some function G : R”*” > R. 
Most of the time, the G function is a norm like the ¢;-norm or a power of a norm, such 
as the £2 norm to the square. The radius defining the local subset onto which we need to 
understand the curvature of the excess risk is also solution of a fixed point equation 









































So 














r@(A) = inf (>:| sup (A-EA,Z—Z*) < aar >1-a) . (5) 
ZeC:G(Z*—Z)<r 


The difference between the two fixed points r*(A) and ré(A) is that the local subsets 
are not defined using the same proximity function to the oracle Z*; the first one uses the 
excess risk as a proximity function, while the second one uses the G function as a proximity 
function. The G function should play the role of a simple description of the curvature of 
the excess risk function locally around Z*; this is formalized in the next assumption. 


























Assumption 1 For all Z EC, if (EA, Ze Z) < r(A) then ( EA, Z* — Z) > G(Z* — Z). 


Typical examples of curvature functions G will have the form G(Z* — Z) = 6||Z* — Z|" for 
some «K > 1, 0 > 0 and some norm ||-||. In that case, the parameter « was initially called 
the margin parameter Tsybakov (2003); Mammen and Tsybakov (1999). Even though 
the relation given in Assumption 1 has been typically referred to as a margin condition 
or Bernstein condition in the Learning Theory literature, we will rather call it a local 
curvature assumption, following Guédon and Vershynin (2016) and Chinot et al. (2018), 
since this type of relation describes the behavior of the risk function locally around its 
oracle. The main advantage for finding a local curvature function G is that r(A) should 
be easier to compute than r*(A) and r*(A) < r&(A) because of the definition of ré(A) and 
{Z eC : (EA, Z*— Z) < rġ(A)} c {Z EC: G(Z*—Z) < rě(A)} (thanks to Assumption 1). 
We can therefore state the following corollary. 














Corollary 1 We assume that the constraint C is star-shaped in Z* and that the “local 
curvature” Assumption 1 holds for some O0 < A < 1. With probability at least 1 — A, it 
holds true that 














ră(A) > (EA, Z* — Z) > G(Z* — Z). 


When it is possible to describe the local curvature of the excess risk around its oracle by 
some G function and when some estimate of ré(A) can be obtained, Corollary 1 applies 
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and estimation results of Z* by Ž (w.r.t. both the ”excess risk” metric ( EA, Z*- Ż ) and 
the G metric) follow. If not, either because understanding the local curvature of the excess 
risk or the computation of ré(A) is difficult, it is still possible to apply Theorem 1 with the 
global VC approach, which boils down to simply upper bound the fixed point r*(A) used 
in Theorem 1 by a global parameter that is a complexity measure of the entire set C 























r*(A) < inf (r >0:P [sunt —EA,Z-Z*) < (1/2) zis a) . (6) 


Interestingly, if the latter last resort approach is used then, following the approach form 
Guédon and Vershynin (2016), Grothendieck’s inequality Grothendieck (1953); Pisier (2012) 
appears to be a powerful tool to upper bound the right-hand side of (6) in the case of the 
community detection problem, such as in Guédon and Vershynin (2016), as well as in the 
MAX-CUT problem. Of course, when it is possible to avoid this ultimate global, one 
should do so because the local approach will always provide better results. 

Finally, proving a “local curvature” property such as in Assumption 1 may be difficult 
because it requires to understand the shape of the local subsets C ^ {Z : (EA, Z* -Z ) < 
r},r > 0. It is however possible to simplify this assumption if getting estimation results 
of Z* only w.r.t. the G function (and not necessarily an upper bound on the excess risk 
( EA, Z*-Z )) is enough. In that case, Assumption 1 may be replaced by the following one. 






































Assumption 2 For all Z € C, if G(Z* — Z) < r&(A) then (EA, Z* — Z) > G(Z* — Z). 


Assumption 2 assumes a curvature of the excess risk function in a G neighborhood of Z* 
unlike Assumption 1 which grants this curvature in an “excess risk neighborhood”. The 
shape of a neighborhood defined by the G function may be easier to understand (for instance 
when G is a norm, a neighborhood defined by G is the ball of a norm centered at Z* with 
radius ré(A)). In general, the latter assumption and Assumption 1 do not compare. In 
the next result, we show that if Assumption 2 holds then Z can estimate Z* w.r.t. the G 
function. 


Theorem 2 We assume that the constraint C is star-shaped in Z* and that the “local 
curvature” Assumption 2 holds for some 0 < A <1. We assume that the G function is 
continuous, G(0) = 0 and G(\(Z* — Z)) < AG(Z* — Z) for all Ae [0,1] and Z EC. With 
probability at least 1 — A, it holds true that G(Z* — Ê) < rla). 


As a consequence, Theorem 1, Corollary 1 and Theorem 2 are the three tools at our 
disposal to study the performance of SDP estimators depending on the deepness of un- 
derstanding we have on the problem. The best approach is given by Theorem 1 when 
it is possible to compute efficiently the complexity fixed point r*(A). If the latter ap- 
proach is too complicated (likely because understanding the geometry of the local subset 
Ca {Z : (EA, Z* — Z) < r},r > 0 may be difficult) then one may resort to find a curvature 
function G of the excess risk locally around Z*. In that case, both Corollary 1 and The- 
orem 2 may apply depending on the hardness to find a local curvature function G on an 
“excess risk neighborhood” (see Assumption 1) or a “G-neighborhood” (see Assumption 2). 
Finally, if no local approach can be handled (likely because describing the curvature of the 
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excess risk in any neighborhood of Z* or controlling the maximal oscillations of the empir- 
ical process Z —> ( EA—A,Z*—Z ) locally are too difficult) then one may resort ultimately 
to a global approach which follows from Theorem 1 as explained in (6). In the following, 
we will use these tools for various problems. 

















Results like Theorem 1, Corollary 1 and Theorem 2 appeared in many papers on ERM 
in Learning Theory such as in Koltchinskii (2011); Bartlett and Mendelson (2006); Massart 
(2007); Lecué and Mendelson (2013). In all these results, typical loss functions such as 
the quadratic or logistic loss functions, were not linear ones, such as the one we are using 
here. From that point of view, our problem is easier and this can be seen by the simplicity 
to prove our three general results from this section. What is much more complicated 
here than in other more classical problems in Learning Theory is the computation of the 
fixed point because (i) the stochastic processes Z —> (A — EA,Z - Z 2) may be far from 
being a Gaussian process if the noise matrix A — EA is complicated and (ii) the local sets 
{Z e € : (EA, Z* — Z} < r} or {Z e € : G(Z* — Z) < r} for r > 0 may be very hard to 
describe in a simple way. Instrumental results are available in the literature to circumvent 
this kind of problems; see Fei and Chen (2019b). 






































3. Revisiting two results from the community detection literature Fei 
and Chen (2019b); Guédon and Vershynin (2016) 


The rapid growth of social networks on the Internet has lead many statisticians and com- 
puter scientists to focus their research on data coming from graphs. One important topic 
that has attracted particular interest during the last decades is that of community detection 
Fortunato (2010); Porter et al. (2009), where the goal is to recover mesoscopic structures 
in a network, the so-called called communities. A community consists of a group of nodes 
that are relatively densely connected to each other, but sparsely connected to other dense 
groups present within the network. The motivation for this line of work stems not only from 
the fact that finding communities in a network is an interesting and challenging problem 
of its own, as it leads to understanding structural properties of networks, but community 
detection is also used as a data pre-processing step for other statistical inference tasks on 
large graphs, as it facilitates parallelization and allows one to distribute time consuming 
processes on several smaller subgraphs (i.e., the extracted communities). 

One challenging aspect of the community detection problem arises in the setting of 
sparse graphs. Many of the existing algorithms, which enjoy theoretical guarantees, do so 
in the relatively dense regime for the edge sampling probability, where the expected average 
degree is of the order O(logn). The problem becomes challenging in very sparse graphs 
with bounded average degree. To this end, Guédon and Vershynin proposed a semidefinite 
relaxation for a discrete optimization problem Guédon and Vershynin (2016), an instance 
of which encompasses the community detection problem, and showed that it can recover 
a solution with any given relative accuracy even in the setting of very sparse graphs with 
average degree of order O(1). 

A subset of the existing literature for community detection and clustering relies on 
spectral methods, which consider the adjacency matrix associated to a graph, and employ 
its eigenvalues, and especially eigenvectors, in the analysis process or to propose efficient 
algorithms to solve the task at hand. Along these lines, Le et al. (2016) proposed a general 
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framework for optimizing a general function of the graph adjacency matrix over discrete 
label assignments by projecting onto a low-dimensional subspace spanned by vectors that 
approximate the top eigenvectors of the expected adjacency matrix. The authors consider 
the problem of community detection with k = 2 communities, which they frame as an 
instance of their proposed framework, combined with a regularization step that shifts each 
entry in the adjacency matrix by a small constant 7, which renders their methodology 
applicable in the sparse regime as well. 

In the remainder of this section, we focus on the community detection problem on 
random graphs under the general stochastic block model. We will mostly revisit the work 
in Guédon and Vershynin (2016) and Fei and Chen (2019b) from the perspective given 
by Theorem 1. Indeed, thanks to this theorem, it is possible to simplify the proof of Fei 
and Chen (2019b), by avoiding both the peeling argument and the use of the bound from 
Guédon and Vershynin (2016). 

We first recall the definition of the generalized stochastic block model (SBM). We con- 
sider a set of vertices V = {1,--- ,n}, and assume it is partitioned into K communities 
C1,--: ,Cx of arbitrary sizes |C;| = l,- ,|Ck| = lk. 


Definition 3 For any pair of nodes i,j € V, we denote byi ~ j when i and j belong to the 
same community (i.e., there exists k € {1,...,K}) such that i,j € Cy), and we denote by 
ix j ifi andj do not belong to the same community. 


For each pair (i,j) of nodes from V, we draw an edge between i and j with a fixed 
probability p;; independently from the other edges. We assume that there exist numbers p 
and q satisfying 0 < q < p < 1, such that 


Pij > p,ifi~ j and i # j, 
Pij < q, otherwise. 


We denote by A = (Aj,j)1<i,j,<n the observed symmetric adjacency matrix, such that, for all 
1<i<j <n, Aj; is distributed according to a Bernoulli of parameter pij. The community 
structure of such a graph is captured by the membership matrix Z e R”*”, defined by 
Ži; = lifi ~ j, and Zij; = 0 otherwise. The main goal in community detection is to 
reconstruct Z from the observation A. 

Spectral methods for community detection are very popular in the literature Guédon and 
Vershynin (2016); Fei and Chen (2019b); Vershynin (2018); Blondel et al. (2008); Clauset 
et al. (2004). There are many ways to introduce such methods, one of which being via 
convex relaxations of certain graph cut problems aiming to minimize a modularity function 
such as the RatioCut Newman (2006). Such relaxations often lead to SDP estimators, such 
as the ones introduced in Section 1. 

Considering a random graph distributed according to the generalized stochastic block 
model, and its associated adjacency matrix A (ic. A = A! and Aj; ~ Bern(pj;) for 
1 <i<j <n and pj; as defined in (7)), we will estimate its membership matrix Z via the 
following SDP estimator 


Ze argmax( A, Z), 
Zec 
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where C = {Z e R"*",Z > 0,Z > 0,diag(Z) < In, Yi jai Zij < A} and A = Xij- Zig = 


yes \C,|? denotes the number of nonzero elements in the membership matrix Z. The 
motivation for this approach stems from the fact that the membership matrix Z is actually 


the oracle, i.e., Z* = Z (see Lemma 7.1 in Guédon and Vershynin (2016) or Lemma 1 
below), where 





Z* € argmax(EA, Z}. 
ZEC 











Following the strategy from Theorem 1 and from our point of view, the upper bound on 
r*(A) from Guédon and Vershynin (2016) is the one that is based on the global approach 
— that is, without localization. Indeed, Guédon and Vershynin (2016) uses the observation 
that, for all r > 0, it holds true that 











a b 
sup (A-EA,Z—2*) © sup(A—EA,Z— Z") < 2Kc|A—BAley., (8) 
ZeC:{EA,Z*—Z) <r ZeC 





























where ||- leut is the cut-norm! and Kg is the Grothendieck constant (Grothendieck’s inequal- 
ity is used in (b), see Pisier (2012); Vershynin (2018)). Therefore, the localization around 
the oracle Z* by the excess risk “band” B* := {Z : (EA, Z* — Z} < r} is simply removed in 
inequality (a). As a consequence, the resulting statistical bound is based on the complexity 
of the entire class C whereas, in a localized approach, only the complexity of C œa B* mat- 
ters. Next step in the proof of Guédon and Vershynin (2016) is a high-probability upper 
bound on ||A — EA||..,, which follows from Bernstein’s inequality and a union bound since 
one has ||A—EA|... = MAX, ye{—1,1}"(A — EA, xy! }, then for all t > 0, |A—EAI,,, < 
tn(n — 1)/2 with probability at least 1 — exp (2n log 2 — (n(n — 1)t?)/(16p + 8t/3)) where 






























































p A 2/[n(n — 1)] X<; Pij (1 — pij). The resulting upper bound on the fixed point obtained 
in Guédon and Vershynin (2016) is 


r*(A) < (8/3)Ka(2nlog(2) + log(1/A)). (9) 


Finally, under the assumption of Theorem 1 in Guédon and Vershynin (2016) (i.e., for 
some some e€ € (0,1), n > 5.10*/e?, max(p(1 — p), q(1 — q)) > 20/n, p = a/n > b/n = q and 
(a—b)? > 2.104? (a+b)), for A = e357” we obtain (using the general result in Theorem 1) 
with probability at least 1 — A, the bound (EA, Z* — Z) < r*(A) < en? = €||Z*|3, which is 
the result from Theorem 1 in Guédon and Vershynin (2016). Finally, Guédon and Vershynin 
(2016) uses a (global) curvature property of the excess risk in its Lemma 7.2 

















Lemma 1 (Lemma 7.2 in Guédon and Vershynin (2016)) For all Z eC, (EA, Z* — 
Z) > [(p— 4)/21 |2Z* — Zl- 











Therefore, a (global- that is for all Z € C) curvature assumption holds for a G function 
which is here the ¢/”*” norm, a margin parameter « = 1 and 0 = (p—q)/2 for the community 





1. The cut-norm |||, of a real matrix A = (aij)ier,jec with a set of rows indexed by R and a set of 
columns indexed by C, is the maximum, over all J c R and J c C, of the quantity |} ier jeg @ijl- It is 
also the operator norm of A from a to £ı and the “injective norm” in the orginal Grothendieck “résumé” 
Grothendieck (1956); Pisier (2012) 
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detection problem. However, this curvature property is not used to compute a “better” fixed 
point parameter, but only to obtain a ¢/*" estimation bound since 


2 ; gi A 16K G(2n log(2) + log(1/A)) 
s (pa) 4z A 3(p — 4) 


The latter bound together with the sin-Theta theorem allow the authors of Guédon and 
Vershynin (2016) to obtain estimation bound for the community membership vector 2*. 

The approach from Fei and Chen (2019b) improves upon the one in Guédon and Ver- 
shynin (2016) because it uses a localization argument: the curvature property of the excess 
risk function from Lemma 1 is used to improve the upper bound in (9) obtained following a 
global approach. Indeed, Fei and Chen (2019b) obtain a high-probability upper bound on 
the quantity 





Z- Z* 






































sup A-EA,Z-—Z*), 
ZeC:|Z* -Z <r 
depending on r. This leads to an exact reconstruction result in the “dense” case and expo- 
nentially decaying rates of convergence in the “sparse” case. This is a typical example where 
the localization argument shows its advantage upon the global approach. The price to pay 
is usually a more technical proof for the local approach compared with the global one. How- 
ever, the argument from Fei and Chen (2019b) also uses an unnecessary peeling argument 





together with an unnecessary a priori upper bound on jê -Z *| A (which is actually the one 
from Guédon and Vershynin (2016)). It appears that this peeling argument and this a priori 


upper bound on É — Z*| can be avoided thanks to our approach from Theorem 1. This 








improves the probability A and simplifies the proofs (since the result from Guédon 
and Vershynin (2016) is not required anymore, and neither is the peeling argument). For 
the sign clustering problem we consider below as an application of our main results, we will 
mostly adapt the probabilistic tools from Fei and Chen (2019b) (in the “dense” case) to the 
methodology associated with Theorem 1 (without these two unnecessary arguments). 


4. Contributions of the paper 


This section encompasses the main contributions of our paper for the three problems we 
study, namely signed clustering, angular synchronization, and MAX-CUT. 


4.1 Application to signed clustering 


Much of the clustering literature, including both spectral and non-spectral methods, has 
focused on unsigned graphs, where each edge carries a non-negative scalar weight that en- 
codes a measure of affinity (similarity, trust) between pairs of nodes. However, in numerous 
instances, the above-mentioned affinity takes negative values, and encodes a measure of 
dissimilarity or distrust. Such applications arise in social networks where users relation- 
ships denote trust-distrust or friendship-enmity, shopping bipartite networks which capture 
like-dislike relationships between users and products Banerjee et al. (2012), online news 
and review websites, such as Epinions and Slashdot, that allow users to approve or de- 
nounce others Leskovec et al. (2010a), and clustering financial or economic time series data 
Aghabozorgi et al. (2015). Such applications have spurred interest in the analysis of signed 
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networks, which has recently become an increasingly important research topic Leskovec et al. 
(2010b), with relevant lines of work in the context of clustering signed networks including, in 
chronological order, Kunegis et al. (2010); Chiang et al. (2012); Cucuringu et al. (a, 2021). 
The latter work proposed regularized versions of signed clustering methods to handle sparse 
graphs — a regime where standard spectral methods are known to underperform. 

The second application of our proposed methodology is an extension of the community 
detection and clustering problem to the setting of signed graphs, where, for simplicity, we 
assume that an edge connecting two nodes can take either —1 or +1 values. 


4.1.1 A SIGNED STOCHASTIC BLOCK MODEL (SSBM) 


We focus on the problem of clustering a K-weakly balanced graphs”. We consider a signed 
stochastic block model (SSBM) similar to the one introduced in Cucuringu et al. (a), where 
we are given a graph G with n nodes {1,...,n} which are divided into K communities, 
{C1,--- , Cg}, such that, in the noiseless setting, edges within each community are positive 
and edges between communities are negative. 
The only information available to the user is given by a n x n sparse adjacency matrix 
A constructed as follows: A is symmetric, with A; = 1 for all 7 = 1,...,n, and for all 
l<i<gjge<n, Aij = sij(2Bij = 1) where 
By~| Bontyitigs md 8y ~ Bera), 


for some 0 < q < 1/2 < p < 1 and 6 € (0,1). Moreover, all the variables Bij, sij for 
1<i<j <n are independent. 

We remark that this SSBM model is similar to the one considered in Cucuringu et al. 
(a), which was governed by two parameters, the sampling probability 6 as above, and the 
noise level 7, which may flip entries of the adjacency matrix. 

Our aim is to recover the community membership matrix or cluster matrix Z = (Zizi, j<ns 
with Žij = 1 when? ~ j and Žij = 0 when i % j using only the observed censored adjacency 
matrix A. 

Our approach is similar in nature to the one used by spectral methods in community 
detection. We first observe that for a := (p + q — 1) and J = (1)nxn we have Z = Z* 
where 





Z* € argmax(EA — aJ, Z}, (10) 
Zec 


and C = {Z e R”*” : Z > 0, Zi; € [0,1], Za = 1,i = 1,...,n}. The proof of (10) is recalled 
in Section B. 




















Since we do not know EA and a, we should estimate both of them. We will estimate 
EA with A but, for simplicity, we will assume that a is known. The resulting estimator of 
the cluster matrix Z is 

















Že argmax(A — aJ, Z}, (11) 
ZEC 





2. A signed graph is K-weakly balanced if and only if all the edges are positive, or the nodes can be 
partitioned into K e N disjoint sets such that positive edges exist only within clusters, and negative 
edges are only present across clusters Davis (1967). 
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which is indeed a SDP estimator and therefore Theorem 1 (or Corollary 1 and Theorem 2) 
may be used to obtain statistical bounds for the estimation of Z* from (10) by Z. 

We will use the following notations: s := 6(p—q)*, 0 := 6(p—q), p := dmax{1 — ô(2p — 
1)?, 1 — ô(2q — 1)?}, v := max{2p — 1, 1 — 2q}, [m] := {1,--- ,m} for all m € N, ly := |Ch] for 
all k e [K], à? := Tee, u= & (Ch x Ck) and C7 := py Cx x Cyr). We also use the 


notation co,c ;,..., to denote absolute constants whose values may change from one line to 
another. 





4.1.2 MAIN RESULT FOR THE ESTIMATION OF THE CLUSTER MATRIX IN SIGNED 
CLUSTERING 


Our main result concerns the reconstitution of the K communities from the observation of 
the matrix A. In order to avoid solutions with some communities of degenerated size (too 
small or too large), we consider the following assumption. 


Assumption 3 Up to constants, the elements of the partition Ciu- - -uCp of {1,...,n} are 
of same size: there are absolute constant co,cı > 0 such that for any k € [K], n/(ak) < 
|C] = lk < con/K. 


We are now ready to state the main result on the estimation of the cluster matrix Z* from 
(10) by the SDP estimator Z from (11). 


Theorem 3 There is an absolute positive constant co such that the following holds. Grant 
Assumption 8. Assume that 
nvd > logn, (12) 


sn > coK°v (13) 


K log(2eK i 
ang los(2eKn) Z nax (Z, w) (14) 
n p 32 
Then, with probability at least 1 — exp(—dvn) — 3/(2eKn), exact recovery holds true, i.e., 


Z = Z*. We recall the constants defined above : s := ô(p — q)°, 0 := ôlp — q), p := 
ômax{1 — 6(2p — 1)?, 1 — 6(2q — 1)?}, v := max{2p — 1, 1 — 2q}. 








Therefore, we have exact reconstruction in the dense case (that is under assumption (12)), 
which stems from condition (14). The latter condition is in the same spirit as the one in 
Theorem 1 of Fei and Chen (2019b), it measures the SNR. (signal-to-noise ratio) of the 
model which captures the hardness of the SSBM. As mentioned in Fei and Chen (2019b), 
it is related to the Kesten-Stigum threshold Mossel et al. (2015). The last condition (14) 
basically requires that the number of clusters K is at most n/logn. If this condition is 
dropped out, then we do not have anymore exact reconstruction but only a certified expo- 
nential rate of convergence: there exists a universal constant C2 such that, with probability 
at least 1 — exp(—dvn) — 3/(2eKn), we have that 


A 2en? sn 
Ze 2| < 1 
| 1> a0 exp ( a 15) 
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A proof of (15) can be found in Section 5. 

This shows that in the dense case, exact reconstruction is possible when K < n/logn 
and, otherwise, when K = n/logn we only have a control of the estimation error with an 
exponential convergence rate. 

We then obtain results of the same nature as in Fei and Chen (2019b), or in the more 
recent paper Xu et al. (2020). In those two articles, the authors show the existence of a 
phase transition, with exact recovery in the regime K < n/log(n), and exponential rate 
with exponent ~ —sn/K otherwise, where s is some measurement of the signal/noise ratio 
of the problem. Note that the estimation bound is given with respect to the 17%” 
This is not a surprise since it is the behaviour of the excess risk over C around Z*. 

In some recent works Fei and Chen (2019a, 2020), the authors were able to obtain 
sharp constants in the rate (15) for the Synchronization model, the Censored Block Model 
as well as the Stochastic Block Model. Their proof relies on the construction of a dual 
certificate and goes through the study of the dual problem. We see the proof technique 
behind Theorem 3 of different nature as a straight ’primal’ approach and it is not clear 
how to relate the two approaches. The two similar approaches were both developed in the 
compressed sensing and matrix completion problems (to name a few) where the ’primal’ 
approach was based on the Null Space property or the RIP or some neighborliness property 
Foucart and Rauhut (2013); Chafaï et al. (2012) and, at the same time and for the same 
problems, a ’dual’ approach relying on the construction of a dual or approximate dual 
certificate was performed Candés and Tao (2010); Gross (2011). But, to the best of our 
knowledge, no clear connection has been made between the two approaches. It would be 
however interesting to have a clear picture on the two types of approaches, and see if they 
are actually the same or coming from a more general approach. 


norm. 


4.2 Application to angular group synchronization 


In this section, we introduce the group synchronization problem as well as a stochastic 
model for this problem. We consider a SDP relaxation of the original problem (which is 
exact) and construct the associated SDP estimator such as in (3). 

The angular synchronization problem consists of estimating n unknown angles 61,--- , On 
(up to a global shift angle) given a noisy subset of their pairwise offsets (0; — 0;) [27], where 
[27] is the modulo 27 operation. The pairwise measurements can be realized as the edge 
set of a graph G, typically modeled as an Erdés-Renyi random graph Singer (2011). 

The aim of this section is to show that the angular synchronization problem can be 
analyzed using our methodology. In order to keep the presentation as simple as possible, 
we assume that all pairwise offsets are observed up to some Gaussian noise: we are given 
bij = (6; — 0; + ogij)[2r] for all 1 <i < j < n where (g,;:1<1< j <n) are n(n — 1)/2 
i.i.d. standard Gaussian variables and o > 0 is the noise variance. We may rewrite the 
problem as follows: we observe a n x n complex matrix A defined by 

ei) ifi<j 
A = S o [z*(a*)"] where S = (Sij)nxn, Sij = 1 ifi=j, (16) 

esii if i> 5 
t denotes the imaginary number such that 12 = —1, x* = (#*)_, € C”, 2% = e" i = 
1,...,n, Z denotes the conjugate vector of x and S o [x*(a*)'] is the element-wise product 
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(Sij£iZj)nxn: In particular, S is a Hermitian matrix (i.e. $' = S) and ES;; = exp(—o?/2) 
fori # j and ES;; = 1 if i = j. We want to estimate (81,..., ên) (up to a global shift) from 
the matrix of data A. 

Unlike the statistical model introduced in Bandeira et al. (2016), the noise here is mul- 
tiplicative in A. From a physical point of view, it makes more sense to consider an additive 
noise on the offsets i.e. we observe (6; — 0; + o9;;)[27]. The noise becomes multipicative 
by passing to the exponential. However, to compare our methodology with the one from 
Bandeira et al. (2016), we also consider the model therein (that is, an additive noise on 
the matrix Z* = «*(x*)' instead of the additive noise in A). We recover similar results in 
this latter model than the one in Bandeira et al. (2016). For the moment, we consider the 
multiplicative noise and the data matrix A as introduced above in (16); we will turn to the 
addtive noise model from Bandeira et al. (2016) at the very end of this section in a remark. 

The first step is to find an (vectorial) optimization problem which solutions are given 
by (6;)_, (up to global angle shift) or some bijective function of it. Estimating (6;)"_, up 
to global angle shift is equivalent to estimating the vector 2* = (res. The latter is, up 
to a global rotation of its coordinates, the unique solution of the following maximization 
problem 




















argmax fz! RA x} = {(eX%+9))™ : Ay © [0, 277)}. (17) 


xeC”:|x;|=1 











A proof of (17) is given in Section 6. Let us now rewrite (17) as a SDP problem. For all 
x e C”, we have Z EAs = tr(EAX) = (EA, X) where X = xg! and {Z e C™": Z = 
xz" |zi| = 1} = {Z e Hn : Z > 0, diag(Z) = 1n, rank(Z) = 1} where Hp is the set of all nxn 
Hermitian matrices and 1, € C” is the vector with all coordinates equal to 1. It is therefore 
straightforward to construct a SDP relaxation of (17) by dropping the rank constraint. It 
appears that this relaxation is exact since, for C = {Z e H, : Z > 0, diag(Z) = 1}, 


















































argmax(EA, Z) = {Z*}, (18) 
Zec 


where Z* = x*(x*)'. A proof of (18) can be found in Section 6. Finally, as we only observe 
A, we consider the following SDP estimator of Z* 


Že argmax(A, Z). (19) 
Zec 
In the next section, we use the strategy from Corollary 1 to obtain statistical guarantees 
for the estimation of Z* by Z. 
Intuitively, the above maximization problem (18) attempts to preserve the given angle 
offsets as best as possible, by aiming to maximize the following objective function 


n 
argmax ei Aj e's, (20) 
1,...,9n€[0,277) ij=l 
where the objective function value is incremented by +1 whenever an assignment of angles 
6; and 6; perfectly satisfies the given edge constraint 6;; = (0; — 0;)[27] (i-e., for a clean 
edge for which ø = 0), while the contribution of an incorrect assignment (i.e., of a very 
noisy edge) will be almost uniformly distributed on the unit circle in the complex plane. 
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Due to non-convexity of optimization in (20), it is difficult to solve computationally Zhang 
and Huang (2006); one way to overcome this problem is to consider the SDP relaxation 
from (18) such as in (19) but it is also possible to consider a spectral relaxation such as the 
one proposed by Singer (2011) which replaces the individual constraints that all z;’s should 
have unit magnitude by the much weaker single constraint X] |zi|? = n, leading to 


n 


argmax > ZiAijžj. 1) 
Ziene; Vy [zl =n; j=1 


The solution to the resulting maximization problem is simply given by a top eigenvector 
of the Hermitian matrix A, followed by a normalization step. We remark that the main 
advantage of the SDP relaxation (18) is that it explicitly imposes the unit magnitude 
constraint for e®+, which we cannot otherwise enforce in the spectral relaxation solved via 
the eigenvector method in (21) (at the end of the day, our estimator ĉ from Corollary 2 below 
is a top eigenvector which may not satisfied the unit magnitude constraint). The above SDP 
program (18) is very similar to the well-known Goemans-Williamson SDP relaxation for 
the seminal MAX-CUT problem of finding the maximal cut of a graph (the MAX-CUT 
problem is one of the four applications considered in this work, see Section 4.3 below), with 
the main difference being that here we optimize over the cone of complex-valued Hermitian 
positive semidefinite matrices, not just real symmetric matrices. 


4.2.1 MAIN RESULTS FOR PHASE RECOVERY IN THE SYNCHRONIZATION PROBLEM (IN 
THE MULTIPLICATIVE NOISE MODEL) 


Our main result concerns the estimation of the matrix of offsets Z* = x*(a*)' from the 
observation of the matrix A. This result is then used to estimate (up to a global phase 
shift) the angular vector x* = (e~)”_,. Our first result follows from Corollary 1. 


Theorem 4 Let 0<e€<1. Ifo < \/log(en*) then, with probability at least 
1 — exp(—eo*n(n — 1)/2), it holds true that 














(e~7"/2/2) |z 7 4], < (EA, Z* — Ê) < (128/6)Véo*n(n — 1). (22) 


Once we have an estimator Ê for the oracle Z*, we can extract an estimator ĉ for the 
vector of phases x* by considering a top eigenvector (i.e. an eigenvector associated with 
the largest eigenvalue) of Z. It is then possible to quantify the estimation properties of x* 
by ĉ using a sin-Theta theorem and Theorem 4. 


Corollary 2 Letĉ be a top eigenvector of Ê with Euclidean norm |Z], = y/n. LetO<e<1 
and assume that o < /log(en*). We have the existence of a universal constant co > 0 
(which is the constant in the Davis-Kahan theorem for Hermitian matrices) such that, with 
probability at least 1 — exp(—eo4n(n — 1)/2), it holds true that 


min |ê- 22°" ||, < 8cov/2/3e/4e" /402/n. (23) 


z€C:|z|=1 
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It follows from Corollary 2 that we can estimate x* (up to a global rotation z € C : 
|z| = 1) with a @3-estimation error of the order of o?,/n with exponential deviations. Given 
that ||x*|, = yn, this means that a constant proportion of the entries are well estimated 
when e is taken like a constant. For a value of €e ~ 1/n?, the rate of estimation is like 
o”, we therefore get a much better estimation of x* but only with constant probability. 
It is important to recall that Ê and ĉ can be both efficiently computed by solving a SDP 
problem and then by considering a top eigenvector of its solution (for instance, using the 
power method). 

We finish the section on the angular group synchronization with the additive model as 
considered in Bandeira et al. (2016). Our aim is still to put forward our methodology and 
to show that it has a wide spectrum of applications and that, in particular, it covers also 
the model introduced in Bandeira et al. (2016). 


4.2.2 THE ANGULAR GROUP SYNCHRONIZATION MODEL WITH ADDITIVE NOISE FROM 
BANDEIRA ET AL. (2016) 


As mentioned above, we chose to study a multiplicative noise model in A since it makes more 
sense from a physical point of view to have an additive noise on the offsets ĝi; = 0; — 0;[27] 
(this additive noise becoming multiplicative by passing to the exponential in A). However, in 
Bandeira et al. (2016), the authors considered a model with additive noise on Z* = x*(x*)'. 
In this “additive” model, we observe C = Z* + aW, where W is a complex Wigner matrix 
and ø > 0 is the noise level. The MLE Z is solution to the problem 


že argmax a! Cz, (24) 
xeC”,|x;|=1Vi 


which can be hard to compute in practice. Using the same approach as above, a SDP 
relaxation can be obtained by removing a rank one constraint, yielding the SDP estimator 


Le argmax(C, X}, (25) 
Zec 


where C := {Z e Hp: diag(Z) = 1n and Z > 0}. Statistical properties of Z have been 
obtained in Bandeira et al. (2016). We recall this result now. 


Theorem 5 (Theorem 2.1 in Bandeira et al. (2016)) Let? be a solution of (24). Then 


with probability at least 1 — O(n-*), minzeC:|z|=1 [22 — x“ < 120. Moreover, if o < 


(1/18)n!/4, then (25) has a unique solution which is the rank one matrix zg". 


Our methodology (here Corollary 1 is applied) may also be used to handle the “additive” 
noise model from Bandeira et al. (2016). We consider the same SDP estimator Z as defined 
in (25) and we obtain the following result. 


Theorem 6 Let z be a top eigenvector of Z. With probability at least 1 — 5 exp(—n/2), 


min ||z% — 2" ||, < 40co0 
z€C:|z|=1 


where co is the constant appearing in the Davis-Kahan theorem for Hermitian matrices. 
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Compared with Theorem 2.1 from Bandeira et al. (2016), the estimation rate that we 
get for estimator Z is the same (up to an absolute constant) as the one obtained for the 
MLE č in Theorem 5, it is of the order of ø. Note however that our result for z holds 
without any restriction of the noise level ø, whereas in Theorem 5 one needs ø < (1/18)n'/4 
to get this result for z. Note also that our result holds with exponentially large probability, 
whereas the one in Theorem 5 holds only with polynomial deviation. From a statistical 
perspective, our result improves the one from Bandeira et al. (2016). However, the main 
interest of Theorem 5 is not on the statistical performance of z but on the sharpness of 
the SDP relaxation since it shows that the SDP relaxation (25) is actually exact when 
o < (1/18)n'/*. This is a result that we do not have and that our methodology cannot 
obtain, since it is designed to prove only an estimation bound. But from a statistical point 
of view, it does not improve the estimation rate to know that the SDP relaxation is exact: 
our result shows that the SDP relaxation is doing as good as MLE, without proving that 
they are the same (up to global phase). 

A proof of Theorem 6 is given in Annex C. We actually provide three estimation bounds 
for Z in this proof. We are doing so because our aim is to show how a general methodology 
works in various examples. This methodology relies on the computation of a fixed point 
(ré(A) from (5) here). Hence, understanding how to bound this fixed point is part of the 
objective of this paper. We therefore use the angular group synchronization problem with 
additive noise as a playground to show three different ways to upper bound such a fixed 
point. Using the three computations, we actually obtain the following three upper bounds 


8cooy/n with probability at least 1 — exp(—n?/2) 
mii i lez — z* ll < co4/36KSon!/ with probability at least 1 — exp(—n/2) 
ZEU?| Z| = 
40co0 with probability at least 1 — 5 exp(—n/2), 


where co is the constant appearing in the Davis-Kahan theorem for Hermitian matrices and 
K, 2 is Grothendieck constant in the complex case. Each of the three bounds above follows 
from different upper bounds on the complexity fixed point rý; for instance, the second one 
follows from the “global” approach, and the third one follows from a decomposition similar 
to the one from Fei and Chen (2019b) and is the one we used in Theorem 6. 


4.3 Application to the MAX-CUT problem 


Let A? e {0,1}”"*” be the adjacency (symmetric) matrix of an undirected graph G = (V, E°), 
where V := {1,...,n} is the set of the vertices and the set of edges is E? := EU E! u{(i,i) : 
A°. = 1} where E := {(i,j) € V? : i< j and A9, = 1} and E! = {(j,i) : (i,j) € E}. We 
assume that G has no self loop so that A?, = 0 for allie V. A cut of G is any subset S of 
vertices in V. For a cut § c V, we define its weight by cut(G,S) := (1/2) Mi jesxs AP, 
that is the number of edges in E between S and its complement S = V\S. The MAx-CuT 
problem is to find the cut with maximal weight 


S*e arene cut(G, S). (26) 


The MAX-CUT problem is a NP-complete problem, but Goemans and Williamson 
(1995) constructed a 0.878 approximating solution via a SDP relaxation. Indeed, one can 
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write the MAX-CUT problem in the following way. For a cut S c V, we define the 
membership vector x € {—1,1}” associated with S by setting x; := 1 if i € S and x; = —1 
if i ¢ S for all i e V. We have cut(G,S) = (1/4) S524 A — zixj) := cut(G,x) and so 
solving (26) is equivalent to solving 


x” € argmax cut(G, x). (27) 
xe{—1,1}” 


Since (a;2;);.; = xx! , the latter problem is also equivalent to solving 
ilij 


1 n 
max k $ A9 a — Zij) : rank(Z) = 1, Z > 0, Zī = 1) (28) 
ij=1 


which admits a SDP relaxation by removing the rank-1 constraint. This yields the following 
SDP relaxation problem of MAX-CUT from Goemans and Williamson (1995) 
Zre argmin(A°, Z), (29) 
Zec 
where C := {ZERO : Z > 0, Zu = 1, Vi = 1,..., n}. 

Unlike the other examples from the previous sections, the SDP relaxation in (29) is 
not exact, except for bipartite graphs; see Khot and Naor (2009); Gärtner and Matoušek 
(2012) for more details. Nevertheless, thanks to the approximation result from Goemans 
and Williamson (1995), we can use our methodology to estimate Z* and then deduce an 
approximate optimal cut. The MAX-CUT problem is therefore a good setup for us to test 
our methodology in a context where the SDP relaxation is not exact, but still widely used in 
practice. Thus the type of question we want to answer here is: what can we say in a setup 
where only partial or noisy information is available on E[A], and when the SDP relaxation 
associated with E| A] is also not exact? This differs from the previous setup where exactness 
of the SDP relaxation holds, and this interesting peculiarity is one of the reasons why we 
have chosen to present this problem here. Our motivation stems from the observation that, 
in many situations, the adjacency matrix A? is only partially observed, but nevertheless, 
it might be interesting to find an approximating solution to the MAX-CUT problem. Let 
us then introduce a stochastic model for the partial information available on E[A], the 
adjacency matrix here. 

We observe A = S o A? = (sij AP i<iicn a “masked” version of A°, where S e R”*” is 
symmetric with upper triangular matrix filled with i.i.d. Bernoulli entries: for all i,j € V 
such that i < j, Sij = Sji = sij where (s;;)i<; is a family of i.i.d. Bernoulli random variables 
with parameter p € (1/2,1). Let B := —(1/p)A so that E[B] = —A°. We can write Z* 
as an oracle since Z* € argmax zec ( EB,Z and so we estimate Z* via the SDP estimator 






























































ĉe argmaxzeç( B ,Z X. Our first aim is to quantify the cost we pay by using Z instead of 
Z* in our final choice of cut. It appears that the fixed point used in Theorem 1 may be 
used to quantify this loss 














r*(A) =inf |r >0:P sup (B-EB,Z-—Z*) < (1/2)r| >1-A |. (30) 
ZeC:(EB,Z*—Z) <r 














Our second result is an explicit high-probability upper bound on the latter fixed point. 
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4.3.1 MAIN RESULTS FOR THE MAX-CUT PROBLEM 


In this section, we gather the two results on the estimation of Z* from Z and on the 
approximate optimality of the final cut constructed from Z. Let us now explicitly provide 
the construction of this cut. We consider the same strategy as in Goemans and Williamson 
(1995). Assume that Z has been constructed. Let G be a centered Gaussian vector with 
covariance matrix Ê. Let ĉ be the sign vector of G. Using the statistical properties of Z, 
it is possible to prove near optimality of ĉ. 

We denote the optimal values of the MAX-CUT problem associated with the graph G 
and its SDP relaxation by 





1 
= 0 = 0 ne 
SDP(G) := (1/4)(A®, J — Z*) max 5 3 A? (1 — Zij) and MAXCUT(G) := cut(G, $*), 


where S* is a solution of (26) and J = (1)nxn- Our first result is to show how the 0.878- 
approximation result from Goemans and Williamson (1995) is downgraded by the incom- 
plete information we have on the graph (since we only partially observed the adjacency 
matrix A? via the masked matrix A). 


Theorem 7 For all0 < A <1. With probability at least 1— A (with respect to the masked 
S), it holds true that 





0.878r*(A) 
— 











SDP(G) > E Jeut(@, 2)12] > 0.878SDP(G) — 


To make the notation more precise, ĉ is the sign vector of G which is a centered Gaussian 














variable with covariance Z. In that context, E [cut(G, #)|2] is the conditional expectation 


according to G for a fixed Z. Moreover, the probability “at least 1 — A” that we obtain is 
w.r.t. the random masks, that is to the randomness in A. 

Let us now frame Theorem 7 into the following perspective. If we had known the entire 
adjacency matrix (which is the case when p = 1), then we could have used Z* instead of 
Z. In that case, for x* the sign vector of G* ~ N(0,Z*), we know from Goemans and 
Williamson (1995) that 





SDP(G) > E [cut(G, 2*)] > 0.878SDP(G). (31) 











Therefore, from a trade-off perspective, Theorem 7 characterizes the price we pay for not 
observing the entire adjacency matrix A°, but only a masked version A of it. It is an 
interesting output of Theorem 7 to observe that the fixed point r*(A) measures, in a 
quantitative way, this loss. If we were able to identify scenarios of p and E for which 
r*(A) = 0, that would prove that there is no loss for partially observing A? in the MAX- 
CUT problem. The approach we use to control r*(A) is the global one, which does not 
allow for exact reconstruction (that is, to show that r*(A) = 0). 
Let us now turn to an estimation result of Z* by Z via an upper bound on r*(A). 


Theorem 8 With probability at least 1 — 4”: 


5 2 log 4)(1 — = log 4 
EB, Z* _ Z) < r*(47”) < mj! og )( p)(n ) i 8n ce 
P 














—~ 
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In particular, it follows from the approximation result from Theorem 7 and the high- 
probability upper bound on r*(A) from Theorem 8 that, with probability at least 1 — 47” 




















A 0.878 2 log 4)(1 — —1 8n log 4 
E [cut(G, #)|2] > 0.8788DP(G) — = (enf! er lh a re > . (32) 
P 
This result is non-trivial only when the right-hand side term is strictly larger than 0.5 - SDP(G), 
which is the performance of a random cut. As a consequence, (32) shows that one can still 
do better than randomness even in an incomplete information setup for the MAX-CUT 
problem when p, n and SDP(G) are such that 


0.378SDP(G) > ae [om] Gee —p)(n=1) | gs | 








p 3 


For instance, when p is like a constant, it requires SDP(G) to be larger than con3/ (for 
some absolute constant co) and when p = 1 — 1/n, it requires SDP(G) to be at least con 
(for some absolute constant co). 


Remark 1 To get exact recovery, that is r*(A) = 0, in the MAX-CUT problem (which 
shows that there is no loss for the MAX-CUT problem by observing only a masked version 
of the adjacency matrix), we have to develop a local approoach, as for the Signed Clustering 
and the Group Synchronization problems. To that end, we would need to solve the following 
two problems: 1) Find a curvature for the objective function Z —> (EB, Z* — Z) and 2) 
Study the oscillations of the empirical process Z > ( EB — B, Z* — Z). We leave those two 
difficult problems for future research. 


























5. Proof of Theorem 3 (signed clustering) 


The aim of this paper is to put forward a methodology developed in Learning Theory for 
the study of SDP estimators. In each example, we follow this methodology. For a problem, 
such as the signed clustering, where it is possible to characterize the curvature of the excess 
risk, we start to identify this curvature because the curvature function G, coming out of 
it, defines the local subsets of C driving the complexity of the problem. Then, we turn 
to the stochastic part of the proof, which is entirely summarized into the complexity fixed 
point r(A) from (5). Finally, we put the two pieces together and apply the main general 
result from Corollary 1 to obtain estimation results for the SDP estimator (11) in the signed 
clustering problem, which is summarized in Theorem 3. 


5.1 Curvature equation 











In this section, we show that the objective function Z €e C > (Z, EA — aJ ) satisfies a 
curvature assumption around its maximizer Z* with respect to the ¢/*”"-norm given by 
G(Z* — Z) = 0 |Z* — Z|, with parameter 0 = 6(p — q) (and margin exponent «K = 1). 














Proposition 1 For 0 = 6(p—q), we have for all Z e C, (EA—aJ, Z* —Z) = 0 |Z* — Z|\,. 











22 


LEARNING WITH SDP ESTIMATORS 


Proof Let Z be in C. We have 


























(Z* — Z,EA-—aJ) = Ds (2* — Z); ŒA; — a) 
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=6(p-9)| X (4 -2Zy- X (Z - Dy 
(i,jyect (i,jeC- 
Moreover, for all (i,j) €C*, Z% = 1 and 0 < Zij < 1, so (Z* — Z)ij = |(Z* — Z)i;|. We also 
have for all (i, j) e C7, (Z* — Z)ij = — Zij = —|(Z* — Z)ij| because in that case Z% = 1 and 
0< Zij < 1. Hence, 

















(Z* — Z)ij| | = 0 |Z* — Zll- 





(Z*—Z,EA-aJ) =45(p—q)| X (Z -2J+ > 
(i,j)eCt (i,j)eC- 


5.2 Computation of the complexity fixed point r(A) 











Define W := A — EA the noise matrix of the problem. Since W is symmetric, its entries 
are not independent. In order to work only with independent random variables, we define 
the following matrix Y e R”*”: 





= $J Wig ifi<g 
Bs f 0 otherwise, ce) 


where 0 entries are considered as independent Bernoulli variables with parameter 0 and 
therefore, Y has independent entries, and satisfies the relation W = Y + Y!. 

In order to obtain upper bounds on the fixed point complexity parameter ré(A) associ- 
ated with the signed clustering problem, we need to prove a high-probability upper bound 
on the quantity 


sup (W, Z — Z*), (34) 
ZeC:|Z—Z* |, <r 


and then find a radius r as small as possible such that the quantity in (34) is less than 
(0/2)r. We denote C, := C a (Z* + rB?*") = {ZeEC:|Z— Z*||, <r} where BP?*” is the 
unit ¢7*”-ball of R”*”. 

We follow the strategy from Fei and Chen (2019b) by decomposing the inner product 
(W, Z-Z iD into two parts according to the SVD of Z*. This observation is a key point 
in the work of Fei and Chen (2019b) compared to the analysis from Guédon and Vershynin 
(2016). This allows to perform the localization argument efficiently. Up to a change of 
index of the nodes, Z* is a block matrix with K diagonal blocks of 1’s. It therefore admits 
K singular vectors Us, := I(i € Cx) /4/|Cx| with multiplicity lẹ associated with the singular 
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value lẹ for all k e [K]. We can therefore write 


K 
i J lUs ® Usk = UDU! , 
k=1 


where U e R"™* has K column vectors given by Usp, k = 1,..., K and D = diag(l,... lg). 
We define the following projection operator 


P : M e R?*” = UUT M + MUUT — UUT MUUT, 


and its orthogonal projection P+ by 


Pt: M eR?” > M—P(M) = (In—UUT)M(Un—UU") = J, (MU. @U ch) Ue QU er 
k=K+1 


where Usk E R”, k = K +1,...,n are such that (Us, : k =1,...,n) is an orthonormal basis 


of R”. 
We use the same decomposition as in Fei and Chen (2019b): for all Z € C, 


(W,Z—Z*\ =(W,P(Z— Z*) + PHZ — Z*)) = (P(Z — Z*), W) + (PHZ — Z*), W). 
a 1m 
Si (Z) S2(Z) 
The next step is to control with large probability the two terms S1(Z) and S2(Z) uniformly 
for all Z e C a (Z* +rBy*"). To that end, we use the two following propositions where we 


recall that p = ô max(1— 6(2p — 1)?, 1 — 6(2qg—1)) and v = max(2p — 1,1 — 2q). The proof 
of Proposition 2 and 3 can be found in Section B, it is based on Fei and Chen (2019b). 





Proposition 2 There are absolute positive constants co, c1, c2 and c3 such that the following 
holds. If |cırK/n| > 2eKnexp(—(9/32)np/K) then we have 


5] 
K IK ark n 
P sup Si(Z) < car So ( : z) >1-3(! i :) : 


ark 
ZeCa(Z*+rBr™") ae 2eKn 











Proposition 3 There exists an absolute constant co > O such that the following holds. 
When nvô > logn, with probability at least 1 — exp(—dvn), 


Jô 
sup S2(Z) < cokr A 
ZeCa(Z*+rB?*”) 4 


It follows from Proposition 2 and Proposition 3 that when nvô > logn, for all r such that 
[cir /n| > 2eKnexp(—(9/32)np/K) we have, for 


park) 


A = A(r) := exp(—dvn) — 3 (i= eexn) 3 
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with probability at least 1 — A, 


ô K 2eK 
sup (W,Z — Z*) < egK ra] + cor | P log — ; 
ZeCa(Z*+rBr*") i \ it [eae 


Moreover, we have 
ov | Kp 2eKn 0 
cokra| = + an a log 3 < z (35) 


for 0 = ô(p — q) when K\/v S V'nd(p — q) and [cır K/n] > 2eKnexp(—6?n/(Kp)). In par- 
ticular, when (p—q)?né > K*v and 1 > 2eKn max ( exp(—6?n/(Kp)), exp(—(9/32)np/K)), 
we conclude that for all 0 < r < n/(c1K) (35) is true. Therefore, one can take rä(A(0)) = 
0 meaning that we have exact reconstruction of Z*: if (p — q)?nd > Kv andn = 
K max(p/6? log(2eK?p/02), (1/p) log(2eK?/p)) then with probability at least 1-exp(—dvn)— 
3/(2eKn), Z = Z*. 

If (p—q)?né > K*v and 1 < 2eKn max ( exp(—6?n/(K p)), exp(—(9/32)np/K)) then we 
do not have exact reconstruction anymore, but we see that (35) is true for any r such that 
2eKn exp(—6?n/(16c2Kp)) < park] < 2eKn exp(—c? K 6v/(c3p)), which is possible since 
(p — q)°*nd > K?v, and then we can conclude that r*(A) < 2en* exp(—6?n/(16c3K p)). 

Therefore, it follows from Corollary 1 that 


x 2en? 6? 
|z — ż| g exp = . 
1 c10 165K p 

















6. Proofs of Theorem 4 and Corollary 2 (angular synchronization) 
Proof of (17): We recall that the offsets are 6;; = 0; — 0;[27] and we will use that if g 


is N(0,1) then Ee’? = e-? /2, For all y,...,%m € [0,27), we have yi — yj = ĝi; for all 
i + j € [n] if and only if e°”? EA;jei — e'i = 0 for all i # j € [n]. We therefore have 


















































argmin 4 X) |e? PE Agaj — xil? p = {(e%4%))@ | : Øo e [0,27)}. (36) 
xeC”: |x; |=1 ixj 
Moreover, for all x = (x;)"_, € C” such that |x;| = 1 for i=1,...,n, we have 
3 n 
P |e? /2 HAL; — xil? = D |np 2523 — xil? = > |e T} — ziz; 
AJ Aj ij=1 
n 
= 2n? — 2R ( po ajajaja) = 2n? — 2\(x*,x)|? 
ij=1 


where R(z) denotes the real part of z € C. On the other side, we have 














n 
zl (er /2 EA)x = ` Tixi Tjj + >, He" (2a; = n(e? 2 — 1) + |(x*,2x)|?. 
i#j i=1 
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Hence, minimizing £ > Jys \er/? EAi;x; — xil? over all x = (aj); € C” such that |x;| = 1 

















is equivalent to maximize x — %'EAz over all x = (a;); € C” such that |x;| = 1. This 
concludes the proof with (36). E 


Proof of (18): Let C’ = {Z e C"*” : |Z;;| < 1,vi,j € [n]}. We first prove that C c C. 
Let Z e C. Since Z > 0, there exists X ¢ C”*” such that Z = XX!. For allie {1,...,n}, 
denote by Xj. the i-th row vector of X. We have |X li = (Xie, Xie) = Z;; = 1 since 
diag(Z) = In. Moreover, for all i, j € [n], we have |Zi;| = |(Xie, Xjo)| < ||Xielly |X;ell < 1 
This proves that Z € C’ and so C c C. 

Let Z’ € argmax (R((EA, Z)) : Ze C’}. Since C’ is convex and the objective function 
Z — R((EA, Z)) is linear (for real coefficients), Z’ is one of the extreme points of C’. 
Extreme points of C’ are matrices Z e C”*” such that |Z;;| = 1 for all i,j € [n]. We can 
then write each entry of Z’ as Z;; = e'%ii for some 0 < fij < 2m and now we obtain 


















































R EA, Z')) = a( 2 AT, = (5 A + (5 ka 


ij=1 iżj i=1 


= Y? 67"? cos(5ij — Biz) JE (Su — bu) < °P (n - n) +n 
17] 


The maximal value e7 */2(n2 — n) +n is attained only for bij = di; for all i,j € [n], that is 
for Z' = (e); j=1,..n = Z*. But we have Z* e C and C c C’, so a is the only maximizer 
of Z > R((EA, Z)) on C. But for all Z € C we have (EA, Z} = x* T Zg* €R, then Z* is the 
only maximizer of Z > ( EA, Z ) over C. a 






































6.1 Curvature of the objective function 














Proposition 4 For 6 = e~°’//2, we have for all Z € C, (EA, Z* — Z) > 0 |Z* — Z|ż. 


Proof Let Z = (zije fi )p i 1) E C where zij € R and 0 < ij < 27 for all i,j € [n]. Since 


Z% = Zu = 1 for all i € [n], we have, on one side, (EA, Z* — Z} = e~? ?/27¥ T(Z*-—Z)z* ER, 
and so 





















































(EA, Z* — Z} = R((EA, Z* — Z)) = e( >, EAy(Z* - 7) 
ij=l 
= e( y e772 etij (es = nie) 


ij=l 


= aren X, 1 = zige a) arr? 5) (1 = zij cos(ðij — Big)). (37) 


LL j=l ij=l 
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On the other side, we showed in the proof of (18) that C c {Z e C"*" : |Zjj| < 1, Vi, j € [n]}. 
So we have |z;| < 1 for all i, j € [n] and 


n 


|Z" — z= So (Zt — Zy = YY e*s — ye P= Yo j- aeiae 
t.j=1 t.j=1 t,j=1 

n 

= y (1 — Žij cos( Bij — 5ij))? + zi sin? (Bi; _ bij) =e 1 = 225; cos( Bi; — dij) + za 
i, j=1 i, j=1 
n 

<2 3 (1 — zij cos( Bij — ĝij)). (38) 

i, j=1 

We conclude with (37) and (38). E 


In fact, it follows from the proof of Proposition 4 that we have the following equality: for 
all ZEC, 














2 2 2 
(EA, Z* — Z) =6(|Z* - Z|} + IZ"? - |Z? 1), 
where |Z]? = (|Zi;|?)1<ij<n (in particular, |Z*|? = (1)nxn). We therefore know exactly how 
to characterize the curvature of the excess risk for the angular synchronization problem in 
terms of the l> (to the square) and the ¢; norms. Nevertheless, we will not use the extra 
term |||Z*|? — AHR in the following. 


6.2 Computation of the complexity fixed point ré(A) 


It follows from the (global) curvature property of the excess risk for the angular synchro- 
nization problem obtained in Proposition 4 that for the curvature G function defined by 
G(Z* — Z) = 0 | Z* — Z|; ,VZ e C, we just have to compute the ré@(A) fixed point and then 
apply Corollary 1 in order to obtain statistical properties of Z (w.r.t. to both the excess 
risk and the G function). In this section, we compute the complexity fixed point r(A) for 
0<A<1. 

Following Proposition 4, the natural “local” subsets of C around Z* which drive the 
statistical complexity of the synchronization problem are defined for all r > 0 by Cr = {Z € 
C: |Z- Z*|, <r} =C a (Z* +rBy*"). 

Let Z € Cr. Denote by bf (resp. bi.) the real (resp. imagitan) part of t= = ZýZij — Zi; 
for all i,j € [n]. Since |Z — Z* |. < r we also have >); ;(0;; Ryo 1 + (bf, )? <r? and so 






































(A-EA, Z — Z*) = ((S —ES) 0 Z*, Z — Z*) = 2R (Ses 5) 


V<J 














= 2S \(cos(ogi;) -E cos(ogij))bE — sin(ogij)bi 


i<j 














X (cos(ogi;) — Ecos(ogij))? + (sin(ogi;))? 


t<J 




















< 2r, |1— e77 + 2670/2 (= E cos(ogi3) — cso) 


i<j 
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where we used that E cos(og) = R(Ee’’) = e~?’/? for g ~ N(0,1). Now it remains to get a 
high-probability upper bound on the sum of the centered cosinus of og;;. We use Bernstein’s 
inequality (see Equation 41 below) to get such a bound. For all t > 0, with probability at 
least 1 — exp(—t), 








1 2t 2 2t 
—— E cos(øgij) — cos(a gij) < V2Vt + —— < (1 -e )Vt+ —, 
JN 2 ( Jij) ( gij) 3/N ( ) Z/N 


for N = n(n — 1)/2 and V = Ecos?(og) — (Ecos(og))? = (1/2)(1 — e77°)2 (because 
E cos?(og) = (1/2)E(1 + cos(20g)) = (1/2)(1 + e~2?") when g ~ N(0,1)). 

We now have all the ingredients to compute the fixed point r(A) for 0 < A < 1: for 
0 = e7?°/2/2 and t = log(1/A), 


4 2 2t 
r4&(A) <5 (1 — e7? 4 27”, (a — e7” VEN + 3) = +8(1—e7%) (e7 2 +2VtN). 

































































In particular, using 1 — e77” < ø? and for t = eo4N (where N = n(n — 1)/2) for some 
0 < €< 1, if e?°/2 < 202VeN then r#(A) < (128/3)o4N ye. 


6.3 End of the proof of Theorem 4 and Corollary 2: application of Corollary 1 


Take A = exp(—eotN) (for N = n(n — 1)/2), we have r(A) < (128/3),/eo4N when 
el < 2,4/eo? N (which holds for instance when o < 4/log(eN?)) and so it follows from 
Corollary 1 (together with the curvature property in Section 6.1 and the computation of the 
fixed point r(A) from Section 6.2), that with probability at least 1—exp(—eo*n(n— 1)/2), 
o(Z* — z < (EA, Z* — Z) < (128/3)veafN, which is the statement of Theorem 4. 

Proof of Corollary 2: The oracle Z* is the rank one matrix a*a*| which has n for 
largest eigenvalue and associated eigenspace {Ax* : Ae C}. In particular, Z* has a spectral 
gap g =n. Let ĉe C” be a top eigenvector of Ê with norm |ê] = y/n. It follows from the 
Davis-Kahan Theorem (see, for example, Theorem 4.5.5 in Vershynin (2018) or Theorem 4 
in Vu (2010)) that there exists an universal constant co > 0 such that 
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where g = n is the spectral gap of Z*. We conclude the proof of Corollary 2 using the 
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upper bound on |Z — z*| i from Theorem 4. m 


7. Proofs of Theorems 7 and 8 (MAX-CUT) 


In this section, we prove the two main results from Section 4.3 using our general methodology 
for Theorem 8 and the technique from Goemans and Williamson (1995) for Theorem 7. 


7.1 Proof of Theorem 7 


The proof of Theorem 7 follows the one from Goemans and Williamson (1995) up to a 
minor modification due to the fact that we use the SDP estimator Z instead of the oracle 
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Z*. It is based on two tools. The first one is Grothendieck’s identity: let g ~ N(0,I,) and 
u,v E€ Sr we have 














: oot Ay 
E[sign((g, u))sign((g, u))| = —aresin((u, vY), (39) 
and the identity: for all t e [—1, 1] 
2 2 
1— ~arcsin(t) = ~arecos(t) > 0.878(1 — t). (40) 


We now have enough tools to prove Theorem 7. The right-hand side inequality is trivial 
since MAXCUT(G) < SDP(G). For the det hand side, we denote by X1,...,Xn (resp. 
X*,...,X*) the n columns vectors in S37! of Ê (resp. Z*). We also consider the event 0* 
onto which 














(EB, Z* — Z) < r*(A), 


which hold with probability at least 1 — A according to Theorem 1. On the event 2*, we 
have 


z K #)|2|=E i, EAN - aa) -i X 49 (1 = Elsien(( Xi, 9))sien((Xj, 9))1) 
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IS 
Kii 
it 
Bas 
n 


b= = aresin( (Xi, %)) = 3 F 5 A}, ; ALCCOS (Ri X;)) 
j=l 
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1,j=1 j=l 
a SE (a, J—Z*\4+ 2 A, Z* — Z) = 0.878SDP(G) — i EB, Z* — Z) 


> 0.878 SDP(G) — OER aca) 


where we used (39) in (i) and (40) in (ii). 


7.2 Proof of Theorem 8 


For the MAX-CUT problem, we do not use any localization argument; we therefore use 
the (likely sub-optimal) global approach. The methodology is very close to the one used in 
Guédon and Vershynin (2016) for the community detection problem. In particular, we use 
both Bernstein and Grothendieck inequalities to compute high-probability upper bound for 
r*(A). We recall theses two tools now. First Bernstein’s inequality: if Y1,...,Yy are N 
independent centered random variables such that |Y;| < M a.s. for alli = 1,...,N then for 
all t > 0, with probability at least 1 — exp(—t), 


N 
1 2Mt 
— YY, < ovat, 41 
N ‘ AT (41) 
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where o? = (1/N) FA , var(Y;). The second tool is Grothendieck inequality Grothendieck 
(1956) (see also Pisier (2012) or Theorem 3.4 in Guédon and Vershynin (2016)): if C € R”*” 
then 


sup(C; Z) < Ka [Cle = Ke ene 5 Cijsitj (42) 
El s,t 


n 
=Le 


where C = {Z > 0 : Zu = 1,1 = 1,...,n} and Kg is an absolute constant, called the 
Grothendieck constant. 

In order to apply Theorem 1, we just have to compute the fixed point r*(A). As 
announced, we use the global approach and Grothendieck inequality (42) to get 











sup (B-EB,Z-Z*)< sup( B - EB, Z — Z*) < 2Kg|B-EB| 
Zec:(EB,Z*-Z}<r 




















(43) 











cut ? 


because Z* € C. It follows from Bernstein’s inequality (41) and a union bound that for all 
t > 0, with probability at least 1 — 4” exp(—t), 
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|B —EB||..~4 = sup (Bi; — EBj;)(sitj + sjti) < a 


s,te{+1}” 1<i<j<n 


Therefore, for t = 2n log 4, with probability at least 1 — 47”, 








(2log4)(1—p)(n—1)  8nlog4 
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r*(A) <|B-EBlay < 2m] 


for A = 4~”. Then the result follows from Theorem 1. 


8. Numerical experiments 


This section contains the outcome of numerical experiments on the three application prob- 
lems considered: signed clustering, MAX-CUT, and angular synchronization. 


8.1 Signed Clustering 


To assess the effectiveness of the SDP relaxation, we consider the following experimental 
setup. We generate synthetic networks following the signed stochastic block model (SSBM) 
previously described in Section 4.1.1, with K = 5 communities. To quantify the effectiveness 
of the SDP relaxation, we compare the accuracy of a suite of algorithms from the signed 
clustering literature, before the SDP relaxation (i.e., when we perform these algorithms 
directly on A) and after the SDP relaxation (i.e., when we perform the very same algorithms 
on Z ). To measure the recovery quality of the clustering results, for a given indicator set 





T1,---, ZK, we rely on the error rate consider in Chiang et al. (2012), defined as 
K TA- £e + TLE ve 
AN c 4*com z c Hcom À (44) 
c=1 











where ze denotes a cluster indicator vector, Acom (= EA) is the complete K-weakly balanced 
ground truth network — with 1’s on the diagonal blocks corresponding to inter-cluster edges, 
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and —1 otherwise — with Acom = Am — Aom, and Lm denotes the combinatorial graph 


Laplacian corresponding to Az,,,. Note that x7 Az,,,2%- counts the number of violations 
within the clusters (since negative edges should not be placed within clusters) and z? LX, 
counts the number of violations across clusters (since positive edges should not belong to 
the cut). Overall, (44) essentially counts the fraction of intra-cluster and inter-cluster edge 
violations, with respect to the full ground truth matrix. Note that this definition can also 
be easily adjusted to work on real data sets, where the ground truth matrix Acom is not 
available, which one can replace with the empirical observation A. 

In terms of the signed clustering algorithms compared, we consider the following algo- 
rithms from the literature. One straightforward approach is to simply rely on the spectrum 
of the observed adjacency matrix A. Kunegis et al. (2010) proposed spectral tools for 
clustering, link prediction, and visualization of signed graphs, by solving a 2-way “signed” 
ratio-cut problem based on the combinatorial Signed Laplacian Hou (2005) L = D — A, 
where D is a diagonal matrix with Dj; = 31", |Ai;|.. The same authors proposed signed 
extensions for the case of the random-walk Laplacian Lw = I — D~'A, and the symmetric 
graph Laplacian Leymi = I —D~'/?AD-"?, the latter of which is particularly suitable for 
skewed degree distributions. Finally, the last algorithm we considered is BNC of Chiang 
et al. (2012), who introduced a formulation based on the Balanced Normalized Cut objective 


KT (pt 
l xe (Dt — A)zre 
NUD g, eg }EL z a! Dr. ) ’ (45) 


c=1 





which, in light of the decomposition Dt —A = Dt —(At—A7) = D*t—At+A7> =Lt4+A, 
is effectively minimizing the number of violations in the clustering procedure. 

In our experiments, we first compute the error rate Yefore of all algorithms on the orig- 
inal SSBM graph (shown in Column 1 of Figure 1), and then we repeat the procedure but 
with the input to all signed clustering algorithms being given by the output of the SDP 
relaxation, and denote the resulting recovery error by Yafter. The third column of the same 
Figure 1 shows the difference in errors ys = Ypefore — Yafter between the first and second 
columns, while the fourth column contains a histogram of the error differences ys. This 
altogether illustrates the fact that the SDP relaxation does improve the performance of all 
signed clustering algorithms, except L, and could effectively be used as a denoising pre- 
processing step. One potential reason why the SDP pre-processing step does not improve 
on the accuracy of L could stem from the fact that L has a good performance to begin 
with on examples where the clusters have equal sizes and the degree distribution is homoge- 
neous. It would be interesting to further compare the results in settings with skewed degree 
distributions, such as the classical Barabasi-Albert model Albert and Barabdasi (2002). 


8.2 MAX-CUT 


For the MAX-CUT problem, we consider two sets of numerical experiments. First, we 
consider a version of the stochastic block model which essentially perturbs a complete 
bipartite graph 


On, XN In, XN 


B = ; (46) 








In xnı On» xn2 
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Figure 1: Summary of results for the Signed Clustering problem. The first column denotes 
the recovery error before the SDP relaxation step, meaning that we consider a number 
of signed clustering algorithms from the literature which we apply directly the initial 
adjacency matrix A. The second column contains the results when applying the same 
suite of algorithms after the SDP relaxation. The third column shows the difference in 
errors between the first and second columns, while the fourth column contains a histogram 


of the delta errors. 


This altogether illustrates the fact the SDP relaxation does improve 


the performance of all signed clustering algorithms except L. Results are averaged over 20 


runs. 
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where 1,, xn, (respectively, On; xnz2) denotes an nı x no matrix of all ones, respectively, all 
zeros. In our experiments, we set ny = n2 = 5, and fix n = 500. We perturb B by deleting 
edges across the two partitions, and inserting edges within each partition. More specifically, 
we generated the full adjacency matrix A? from B by adding edges independently with 
probability 7 within each partition (i.e., along the diagonal blocks in (46)). Finally, we 
denote by A the masked version we observe, A = A? o S, where S denotes the adjacency 
matrix of an Erdés-Rényi(n, 6) graph. The graph shown in Figure 2 is an instance of the 
above generative model. Note that, for small values of 7, we expect the maximum cut to 





Figure 2: Illustration of MAX-CUT in the setting of a perturbation of a complete bipartite 
graph. 


occur across the initial partition Pg in the clean bipartite graph B, which we aim to recover 
as we sparsify the observed graph A. The heatmap in the left of Figure 3 shows the Adjusted 
Rand Index (ARI) between the initial partition Pg and the partition of the MAX-CUT 
SDP relaxation in (29), as we vary the noise parameter 7 and the sparsity 6. As expected, 
for a fix level of noise 7, we are able to recover the hypothetically optimal MAX-CUT, for 
suitable levels of the sparsity parameter. The heatmap in the right of Figure 3 shows the 
computational running time, as we vary the two parameters, showing that the MANOPT 
solver takes the longest to solve dense noisy problems, as one would expect. 

In the second set of experiments shown in Figure 4, we consider a graph A? chosen at 
random from the collection? of graphs known in the literature as the GSET, where we vary 
the sparsity level ô, and show the MAX-CUT value attained on the original full graph A°, 
but using the MAX-CUT partition computed by the SDP relaxation (29) on the sparsified 
graph A. 


8.3 Angular Synchronization 


For the angular synchronization problem, we consider the following experimental setup, by 
assessing the quality of the recovered angular solution from the SDP relaxation, as we vary 
the two parameters of interest. In the x-axis in the plots from Figures 5 and 6 we vary the 





3. http://web. stanford. edu/~yyye/yyye/Gset/ 
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Figure 3: Numerical results for MAX-CUT on a perturbed complete bipartite graph, as 
we vary the noise level 7 and the sampling sparsity 6. Results are averaged over 20 runs. 
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Figure 4: Max-Cut results for the G53 benchmark graph (from the GSET collection) with 
n = 1000 nodes and average degree ~ 12. Results are averaged over 20 runs. 


noise level ø, under two different noise models, Gaussian and outliers. On the y-axis, we 
vary the sparsity of the sampling graph. 

We measure the quality of the recovered angles via the Mean Squared Error (MSE), 
defined as follows. Since a solution can only be recovered up to a global shift, one needs 
an MSE error that mods out such a degree of freedom. The following MSE is also more 
broadly applicable for the case when the underlying group is the orthogonal group O(d), as 
opposed to just SO(2) as in the present work, where one can replace the unknown angles 
6,,...,9, with their respective representation as 2 x 2 rotation matrices hy,...,hn € O(2). 
To that end, we look for an optimal orthogonal transformation O € O(2) that minimizes 
the sum of squared distances between the estimated orthogonal transformations and the 
ground truth measurements 


Ô= argmin Y |h; — OÎil?, (47) 
OcO(2 ) i=l 
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where Î1,..., În € O(2) denote the 2 x 2 rotation matrix representation of the estimated 
angles 61, oe On. In other words, O is the optimal solution to the alignment problem 
between two sets of orthogonal transformations, in the least-squares sense. Following the 
analysis of Singer and Shkolnisky (2011), and making use of properties of the trace, one 
arrives at 


> [hi -—Ohi |e = 


i=1 


Trace | (r = Of.) (ra = Oh) | 


n 
i=1 
n 
i=1 


Trace [27 — 20fuh? | = 4n — 2 Trace 





Oy: z . (48) 
i=1 
If we let Q denote the 2 x 2 matrix 
glS (49) 
mE 

it follows from (48) that the MSE is given by minimizing 

1% A 

= >, li — Ohilẹ = 4 — 2Tr(0Q). (50) 

i=1 


In Arun et al. (1987) it is proven that Tr(OQ) < Tr(VU7Q), for all O € O(3), where 
Q = UEV" is the singular value decomposition of Q. Therefore, the MSE is minimized by 
the orthogonal matrix O = VUT and is given by 


e 1 = AT 
MSE © a X I: — Ohl = 4-2 Trace(VUTUEV") =4-2(01 +02), (51) 
i=1 


where 01,02 are the singular values of Q. Therefore, whenever Q is an orthogonal matrix 
for which c1 = o2 = 1, the MSE vanishes. Indeed, the numerical experiments (on a log 
scale) in Figures 5 and 6 confirm that for noiseless data, the MSE is very close to zero. 
Furthermore, as one would expected, under favorable noise regimes and sparsity levels, we 
have almost perfect recovery, both by the SDP and the spectral relaxations, under both 
noise models. 


9. Conclusions and future work 


There are a number of other graph-based problems amenable to SDP relaxations, for which 
a similar theoretical analysis of their SDP-based estimators could be suitable. For example, 
the recent work of Cucuringu and Tyagi (2020) considered a problem motivated by geo- 
sciences and engineering applications of recovering a smooth unknown function f : G —> R 
(where G = [a,b] is known) from noisy observations of its mod 1 values, which is also 
amenable to a solution based on an SDP relaxation solved via a Burer-Monteiro approach; 
the tightness of such an SDP relaxation was recently analyzed by Fanuel and Tyagi (2021). 
Another potential application concerns the problem of clustering directed graphs and un- 
covering unbalanced flows arising from the edge orientations, as in the very recent work 
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(a) Spectral relaxation. (b) SDP iaon (solved via MANOPT). 
Figure 5: Recovery rates (MSE (51) - the lower the better) for angular synchronization 
with n = 500, under the Gaussian noise model, as we vary the noise level o and the sparsity 
p of the measurement graph. Results are averaged over 20 runs. 
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Figure 6: Recovery rates (MSE (51) - the lower the better) for angular synchronization 
with n = 500, under the Outlier noise model, as we vary the noise level y and the sparsity 
p of the measurement graph. Results are averaged over 20 runs. 


of Cucuringu et al. (b) that proposed a spectral algorithm based on Hermitian matrices; 
this problem is also amenable to an SDP relaxation. Many other situations could also be 
considered under the angle of learning with a ‘linear loss function’ as presented in this 
work; just to name a few, we may think about the phase retrieval problem or the quadratic 
assignment problem from Yurtsever et al. (2021), a graph coloring problem as in Rendl 
(2010), a planted clique problem as in Hajek et al. (2016), the kernel clustering problem as 
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in Giraud and Verzelen, a manifold learning problem as Tepper et al. (2018), the angular 
bi-synchronization problem Cucuringu and Tyagi (2021), etc.. 

Our theoretical and practical findings show that running algorithms (such as spectral 
methods) directly on A may be improved by using first a SDP estimator, such as Z, and run- 
ning the very same algorithms on Z (instead of A). Somehow, Z performs a pre-processing 
denoising step which improve the recovery of the hidden signal, such as community vectors. 


Acknowledgments 


Mihai Cucuringu acknowledges support from the EPSRC grant EP/N510129/1 at The Alan 
Turing Institute. Guillaume Lecué is supported by a grant overseen by the French National 
Research Agency (ANR) as part of the “Investments d’Avenir” Program (LabEx ECODEC; 
ANR-11-LABX-0047), by the Médiamétrie chair on “Statistical models and analysis of high- 
dimensional data” and by by the French ANR PRC grant ADDS (ANR-19-CE48-0005). 


References 


Emmanuel Abbe, Afonso S Bandeira, and Georgina Hall. Exact recovery in the stochastic 
block model. IEEE Transactions on Information Theory, 62(1):471—487, 2015. 


Saeed Aghabozorgi, Ali Seyed Shirkhorshidi, and Teh Ying Wah. Time-series clustering—a 
decade review. Information Systems, 53:16-38, 2015. 


Réka Albert and Albert-Ldszl6 Barabasi. Statistical mechanics of complex networks. Re- 
views of modern physics, 74(1):47, 2002. 


Brendan PW Ames. Guaranteed clustering and biclustering via semidefinite programming. 
Mathematical Programming, 147(1-2):429-465, 2014. 


Miguel F Anjos and Jean B Lasserre. Handbook on semidefinite, conic and polynomial 
optimization, volume 166. Springer Science & Business Media, 2011. 


K. Arun, T. Huang, and S. Bolstein. Least-squares fitting of two 3-D point sets. IEEE 
Transactions on Pattern Analysis and Machine Intelligence, 9(5):698-700, 1987. 


Afonso S Bandeira and Ramon Van Handel. Sharp nonasymptotic bounds on the norm of 
random matrices with independent entries. The Annals of Probability, 44(4):2479-2506, 
2016. 


Afonso S Bandeira, Moses Charikar, Amit Singer, and Andy Zhu. Multireference alignment 
using semidefinite programming. In Proceedings of the 5th conference on Innovations in 
theoretical computer science, pages 459-470. ACM, 2014. 


A.S Bandeira, S. Boumal, and A. Singer. Tightness of the maximum likelihood semidefinite 
relaxation for angular synchronization. Mathematical Programming, 2016. 


Sujogya Banerjee, Kaushik Sarkar, Sedat Gokalp, Arunabha Sen, and Hasan Davulcu. Par- 
titioning signed bipartite graphs for classification of individuals and organizations. In 


37 


CHRETIEN, CUCURINGU, LECUE, NEIRAC 


International Conference on Social Computing, Behavioral-Cultural Modeling, and Pre- 
diction, pages 196-204. Springer, 2012. 


Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probab. Theory Related 
Fields, 135(3):311-334, 2006. ISSN 0178-8051. 


A. I. Barvinok. Problems of distance geometry and convex properties of quadratic maps. 
Discrete & Computational Geometry, 13(2):189-202, Mar 1995. 


Lucien Birgé and Pascal Massart. Rates of convergence for minimum contrast estimators. 
Probab. Theory Related Fields, 97(1-2):113-150, 1993. ISSN 0178-8051. doi: 10.1007/ 
BF01199316. URL http: //dx.doi.org/10.1007/BF01199316. 


Grigoriy Blekherman, Pablo A Parrilo, and Rekha R Thomas. Semidefinite optimization 
and convex algebraic geometry. SIAM, 2012. 


Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast 
unfolding of communities in large networks. Journal of statistical mechanics: theory and 
experiment, 2008(10):P10008, 2008. 


Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities: A 
Nonasymptotic Theory of Independence. Oxford University Press, 2013. ISBN 978-0-19- 
953525-5. 


N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt, a Matlab toolbox for 
optimization on manifolds. Journal of Machine Learning Research, 15:1455-1459, 2014. 
URL http://www.manopt. org. 


N. Boumal, V. Voroninski, and A.S. Bandeira. The non-convex Burer-Monteiro approach 
works on smooth semidefinite programs. In Advances in Neural Information Processing 
Systems 29, pages 2757-2765. 2016. 


Nicolas Boumal. A Riemannian low-rank method for optimization over semidefinite matrices 
with block-diagonal constraints. arXiv preprint arXiv:1506.00575, 2015. 


Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 
2004. 


Stephen Boyd, Laurent El Ghaoui, Eric Feron, and Venkataramanan Balakrishnan. Linear 
matrix inequalities in system and control theory, volume 15. Siam, 1994. 


S. Burer and R.D.C. Monteiro. Local minima and convergence in low-rank semidefinite 
programming. Mathematical Programming, 103(3):427—444, 2005. 


Emmanuel J. Candés and Terence Tao. The power of convex relaxation: near-optimal matrix 


completion. IEEE Trans. Inform. Theory, 56(5):2053-2080, 2010. ISSN 0018-9448. doi: 
10.1109/TIT.2010.2044061. URL http://dx.doi.org/10.1109/TIT.2010.2044061. 


38 


LEARNING WITH SDP ESTIMATORS 


Djalil Chafai, Olivier Guédon, Guillaume Lecué, and Alain Pajor. Interactions between com- 
pressed sensing random matrices and high dimensional geometry, volume 37 of Panoramas 
et Synthèses [Panoramas and Syntheses]. Société Mathématique de France, Paris, 2012. 
ISBN 978-2-85629-370-6. 


Kai-Yang Chiang, Joyce Whang, and Inderjit S. Dhillon. Scalable Clustering of Signed Net- 
works using Balance Normalized Cut. In ACM Conference on Information and Knowledge 
Management (CIKM), oct 2012. 


Geoffrey Chinot, Lecué Guillaume, and Lerasle Matthieu. Statistical learning with Lipschitz 
and convex loss functions. arXiv preprint arXiv:1810.01090, 2018. 


Stéphane Chrétien and Franck Corset. Using the eigenvalue relaxation for binary least- 
squares estimation problems. Signal Processing, 89(11):2079-2091, 2009. 


Stéphane Chrétien, Clément Dombry, and Adrien Faivre. A semi-definite programming 
approach to low dimensional embedding for unsupervised clustering. Frontiers in Applied 
Mathematics and Statistics, to appear. 


Aaron Clauset, Mark EJ Newman, and Cristopher Moore. Finding community structure in 
very large networks. Physical review E, 70(6):066111, 2004. 


M. Cucuringu. Synchronization over Zə and community detection in multiplex networks 
with constraints. Journal of Complex Networks, 3:469-506, 2015. 


M. Cucuringu. Sync-Rank: Robust Ranking, Constrained Ranking and Rank Aggregation 
via Eigenvector and Semidefinite Programming Synchronization. IEEE Transactions on 
Network Science and Engineering, 3(1):58-79, 2016. 


M. Cucuringu, P. Davies, A. Glielmo, and H. Tyagi. SPONGE: A generalized eigenproblem 
for clustering signed networks. AISTATS 2019, a. 


M. Cucuringu, H. Li, H. Sun, and L. Zanetti. Hermitian matrices for clustering directed 
graphs: insights and applications. AISTATS 2020, b. 





Mihai Cucuringu and Hemant Tyagi. Provably robust estimation of modulo 1 samples of 
a smooth function with applications to phase unwrapping. Journal of Machine Learning 
Research, 21(32):1-77, 2020. URL http://jml1r.org/papers/v21/18-143.htm1. 


Mihai Cucuringu and Hemant Tyagi. An extension of the angular synchronization problem 
to the heterogeneous setting. arXiv:2012.14932, 2021. 


Mihai Cucuringu, Apoorv Vikram Singh, Déborah Sulem, and Hemant Tyagi. Regularized 
spectral methods for clustering signed networks. Journal of Machine Learning Research 
(JMLR) (to appear), 2021. 


Mark A Davenport and Justin Romberg. An overview of low-rank matrix recovery from 
incomplete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4): 
608-622, 2016. 


39 


CHRETIEN, CUCURINGU, LECUE, NEIRAC 


Kenneth R Davidson and Stanislaw J Szarek. Local operator theory, random matrices and 
banach spaces. Handbook of the geometry of Banach spaces, 1(317-366):131, 2001. 


Chandler Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. iii. 
SIAM Journal on Numerical Analysis, 7(1):1—46, 1970. 


J. A. Davis. Clustering and structural balance in graphs. Human Relations, 20(2):181—187, 
1967. 


Yohann de Castro, Fabrice Gamboa, Didier Henrion, and Jean B. Lasserre. Exact so- 
lutions to super resolution on semi-algebraic domains in higher dimensions. IEEE 
Trans. Information Theory, 63(1):621-630, 2017. doi: 10.1109/TIT.2016.2619368. URL 
https://doi.org/10.1109/TIT. 2016. 2619368. 


Yohann De Castro, Fabrice Gamboa, Didier Henrion, Roxana Hess, Jean-Bernard Lasserre, 
et al. Approximate optimal designs for multivariate polynomial regression. The Annals 
of Statistics, 47(1):127-155, 2019. 


Laurent Demanet and Paul Hand. Stable optimizationless recovery from phaseless linear 
measurements. Journal of Fourier Analysis and Applications, 20(1):199-221, 2014. 


M. Fanuel and H. Tyagi. Denoising modulo samples: k-nn regression and tightness of sdp 
relaxation. Information and Inference (to appear), 2021. 


M. Fanuel, A. Aspeel, J-C. Delvenne, and J.A.K. Suykens. Positive semi-definite embedding 
for dimensionality reduction and out-of-sample extension. 2017. 


Yingjie Fei and Yudong Chen. Achieving the bayes error rate in stochastic block model 
by sdp, robustly. In Alina Beygelzimer and Daniel Hsu, editors, Conference on Learn- 
ing Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, volume 99 of Pro- 
ceedings of Machine Learning Research, pages 1235-1269. PMLR, 2019a. URL http: 
//proceedings.mlr.press/v99/fei19a.html. 


Yingjie Fei and Yudong Chen. Exponential error rates of SDP for block models: beyond 
Grothendieck’s inequality. IEEE Trans. Inform. Theory, 65(1):551-571, 2019b. ISSN 
0018-9448. 


Yingjie Fei and Yudong Chen. Achieving the Bayes error rate in synchronization and 
block models by SDP, robustly. IEEE Trans. Inform. Theory, 66(6):3929-3953, 2020. 
ISSN 0018-9448. doi: 10.1109/TIT.2020.2966438. URL https://doi.org/10.1109/ 
TIT .2020.2966438. 


Roger Fletcher. A nonlinear programming problem in statistics (educational testing). STAM 
Journal on Scientific and Statistical Computing, 2(3):257—-267, 1981. 


Santo Fortunato. Community detection in graphs. Physics reports, 486(3-5):75-174, 2010. 


Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing. 
Applied and Numerical Harmonic Analysis. Birkhéuser/Springer, New York, 2013. ISBN 
978-0-8176-4947-0; 978-0-8176-4948-7. doi: 10.1007/978-0-8176-4948-7. URL http:// 
dx.doi.org/10.1007/978-0-8176-4948-7. 


40 


LEARNING WITH SDP ESTIMATORS 


Bernd Gartner and Jiří Matoušek. Semidefinite programming. In Approximation Algorithms 
and Semidefinite Programming, pages 15-25. Springer, 2012. 


Jean Charles Gilbert and Cédric Josz. Plea for a semidefinite optimization solver in complex 
numbers. 2017. 


Christophe Giraud and Nicolas Verzelen. Partial recovery bounds for clustering with the 
relaxed K-means. Mathematical Statistics and Learning, 1(3):317-374. 


Michel X Goemans. Semidefinite programming in combinatorial optimization. Mathematical 
Programming, 79(1-3):143-161, 1997. 


Michel X Goemans and David P Williamson. New 34-approximation algorithms for the 
maximum satisfiability problem. SIAM Journal on Discrete Mathematics, 7(4):656-666, 
1994. 


Michel X Goemans and David P Williamson. Improved approximation algorithms for max- 
imum cut and satisfiability problems using semidefinite programming. Journal of the 
ACM (JACM), 42(6):1115-1145, 1995. 


David Gross. Recovering low-rank matrices from few coefficients in any basis. IEEE Trans. 
Inform. Theory, 57(3):1548-1566, 2011. ISSN 0018-9448. doi: 10.1109/TIT.2011.2104999. 
URL http: //dx.doi.org/10.1109/TIT.2011.2104999. 


David Gross, Yi-Kai Liu, Steven T Flammia, Stephen Becker, and Jens Eisert. Quantum 
state tomography via compressed sensing. Physical review letters, 105(15):150401, 2010. 


A. Grothendieck. Résumé de la théorie métrique des produits tensoriels topologiques. Bol. 
Soc. Mat. São Paulo, 8:1-79, 1953. ISSN 0373-1375. 


Alexandre Grothendieck. Résumé de la théorie métrique des produits tensoriels topologiques. 
Soc. de Matematica de Sao Paulo, 1956. 


Olivier Guédon and Roman Vershynin. Community detection in sparse networks via 
Grothendieck’s inequality. Probab. Theory Related Fields, 165(3-4):1025-1049, 2016. ISSN 
0178-8051. URL https://doi.org/10.1007/s00440-015-0659-z. 


Olivier Guédon and Roman Vershynin. Community detection in sparse networks via 
grothendieck’s inequality. Probability Theory and Related Fields, 165(3-4):1025-1049, 
2016. 


Bruce Hajek, Yihong Wu, and Jiaming Xu. Achieving exact cluster recovery threshold via 
semidefinite programming. [EEE Transactions on Information Theory, 62(5):2788-2797, 
2016. 


Simai He, Zhi-Quan Luo, Jiawang Nie, and Shuzhong Zhang. Semidefinite relaxation bounds 
for indefinite homogeneous quadratic optimization. SIAM Journal on Optimization, 19 
(2):503-523, 2008. 


41 


CHRETIEN, CUCURINGU, LECUE, NEIRAC 


Chinmay Hegde, Aswin C Sankaranarayanan, and Richard G Baraniuk. Near-isometric 
linear embeddings of manifolds. In 2012 IEEE Statistical Signal Processing Workshop 
(SSP), pages 728-731. IEEE, 2012. 


Samuel B. Hopkins. Sub-gaussian mean estimation in polynomial time. CoRR, 
abs/1809.07425, 2018. URL http://arxiv.org/abs/1809.07425. 


Jao Ping Hou. Bounds for the least Laplacian eigenvalue of a signed graph. Acta Mathe- 
matica Sinica, 21(4):955—960, 2005. 


Adel Javanmard, Andrea Montanari, and Federico Ricci-Tersenghi. Phase transitions in 
semidefinite relaxations. Proceedings of the National Academy of Sciences, 113(16): 
E2218—-E2223, 2016. 


David Karger, Rajeev Motwani, and Madhu Sudan. Approximate graph coloring by semidef- 
inite programming. Journal of the ACM (JACM), 45(2):246-265, 1998. 


Subhash Khot and Assaf Naor. Approximate kernel clustering. Mathematika, 55(1-2): 
129-165, 2009. 


Vladimir Koltchinskii. Local Rademacher complexities and oracle inequalities in risk 
minimization. Ann. Statist., 34(6):2593-2656, 2006. ISSN 0090-5364. doi: 10.1214/ 
009053606000001019. URL http: //dx.doi.org/10.1214/009053606000001019. 


Vladimir Koltchinskii. Oracle inequalities in empirical risk minimization and sparse recovery 
problems, volume 2033 of Lecture Notes in Mathematics. Springer, Heidelberg, 2011. 
ISBN 978-3-642-22146-0. doi: 10.1007/978-3-642-22147-7. URL http://dx.doi.org/ 
10.1007/978-3-642-22147-7. Lectures from the 38th Probability Summer School held 
in Saint-Flour, 2008, Ecole d’Eté de Probabilités de Saint-Flour. [Saint-Flour Probability 
Summer School]. 


Jérôme Kunegis, Stephan Schmidt, Andreas Lommatzsch, Jürgen Lerner, Ernesto William 
De Luca, and Sahin Albayrak. Spectral analysis of signed graphs for clustering, prediction 
and visualization. SDM, 10:559-570, 2010. 


Jean Bernard Lasserre. An Introduction to Polynomial and Semi-Algebraic Optimization. 
Number 52. Cambridge University Press, 2015. 


Javad Lavaei and Steven H Low. Zero duality gap in optimal power flow problem. IEEE 
Transactions on Power Systems, 27(1):92-107, 2011. 


Can M. Le, Elizaveta Levina, and Roman Vershynin. Optimization via low-rank approxi- 
mation for community detection in networks. Ann. Statist., 44(1):373-400, 2016. ISSN 
0090-5364. URL https://doi.org/10.1214/15-A0S1360. 


Guillaume Lecué and Shahar Mendelson. Learning subgaussian classes: Upper and minimax 
bounds. Technical report, CNRS, Ecole polytechnique and Technion, 2013. 


Michel Ledoux. The concentration of measure phenomenon, volume 89 of Mathematical 
Surveys and Monographs. American Mathematical Society, Providence, RI, 2001. ISBN 
0-8218-2864-9. 


42 


LEARNING WITH SDP ESTIMATORS 


Michel Ledoux and Michel Talagrand. Probability in Banach spaces, volume 23 of Ergebnisse 
der Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas 
(3)]. Springer-Verlag, Berlin, 1991. ISBN 3-540-52013-9. Isoperimetry and processes. 


Claude Lemaréchal, Arkadii Nemirovskii, and Yurii Nesterov. New variants of bundle meth- 
ods. Mathematical programming, 69(1-3):111-147, 1995. 


J. Leskovec, D. Huttenlocher, and J. Kleinberg. Predicting positive and negative links in 
online social networks. In WWW, pages 641-650, 2010a. 


Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. Signed Networks in Social Media. 
In CHI, pages 1361-1370, 2010b. 


László Lovász. On the shannon capacity of a graph. IEEE Transactions on Information 
theory, 25(1):1-7, 1979. 


Wing-Kin Ken Ma. Semidefinite relaxation of quadratic optimization problems and appli- 
cations. IEEE Signal Processing Magazine, 1053(5888/10), 2010. 


Enno Mammen and Alexandre B. Tsybakov. Smooth discrimination analysis. Ann. Statist., 
27(6):1808-1829, 1999. ISSN 0090-5364. doi: 10.1214/aos/1017939240. URL https: 
//doi.org/10.1214/aos/1017939240. 


Bernard Martinet. Algorithmes pour la résolution de problèmes d’optimisation et de mini- 
maz. PhD thesis, Université Joseph-Fourier-Grenoble I, 1972. 


Pascal Massart. Concentration inequalities and model selection, volume 1896 of Lecture 
Notes in Mathematics. Springer, Berlin, 2007. ISBN 978-3-540-48497-4; 3-540-48497-3. 
Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 
6-23, 2003, With a foreword by Jean Picard. 


David A Mazziotti. Large-scale semidefinite programming for many-electron quantum me- 
chanics. Physical review letters, 106(8):083001, 2011. 


Elchanan Mossel, Joe Neeman, and Allan Sly. Reconstruction and estimation in the 
planted partition model. Probab. Theory Related Fields, 162(3-4):431-461, 2015. 
ISSN 0178-8051. doi: 10.1007/s00440-014-0576-6. URL https://doi.org/10.1007/ 
s00440-014-0576-6. 


Y Nesterov. Semidefinite relaxation and non-convex quadratic optimization. Optimization 
Methods and Software, 12:1-20, 1997. 


Mark EJ Newman. Finding community structure in networks using the eigenvectors of 
matrices. Physical review E, 74(3):036104, 2006. 


Carl Olsson, Anders P Eriksson, and Fredrik Kahl. Solving large scale binary quadratic 
problems: Spectral methods vs. semidefinite programming. In 2007 IEEE Conference on 
Computer Vision and Pattern Recognition, pages 1-8. IEEE, 2007. 


Jiming Peng and Yu Wei. Approximating k-means-type clustering via semidefinite pro- 
gramming. SIAM journal on optimization, 18(1):186—205, 2007. 


43 


CHRETIEN, CUCURINGU, LECUE, NEIRAC 


Jiming Peng and Yu Xia. A new theoretical framework for k-means-type clustering. In 
Foundations and advances in data mining, pages 79-96. Springer, 2005. 


Guy Pierra. Decomposition through formalization in a product space. Mathematical Pro- 
gramming, 28(1):96-115, 1984. 


Gilles Pisier. Grothendieck’s theorem, past and present. Bulletin of the American Mathe- 
matical Society, 49(2):237-323, 2012. 


Mason A Porter, Jukka-Pekka Onnela, and Peter J Mucha. Communities in networks. 
Notices of the AMS, 56(9):1082-1097, 2009. 


Franz Rendl. Semidefinite relaxations for integer programming. In Michael Jünger, 
Thomas M. Liebling, Denis Naddef, George L. Nemhauser, William R. Pulleyblank, Ger- 
hard Reinelt, Giovanni Rinaldi, and Laurence A. Wolsey, editors, 50 Years of Integer 
Programming 1958-2008 - From the Early Years to the State-of-the-Art, pages 687-726. 
Springer, 2010. doi: 10.1007/978-3-540-68279-0\_18. URL https://doi.org/10.1007/ 
978-3-540-68279-0_18. 


Martin Royer. Adaptive clustering through semidefinite programming. In Advances in 
Neural Information Processing Systems, pages 1795-1803, 2017. 


P Scobey and DG Kabe. Vector quadratic programming problems and inequality con- 
strained least squares estimation. J. Indust. Math. Soc., 28:37—49, 1978. 


Alexander Shapiro. Weighted minimum trace factor analysis. Psychometrika, 47(3):243- 
264, 1982. 


A. Singer and Y. Shkolnisky. Three-dimensional structure determination from common lines 
in Cryo-EM by eigenvectors and semidefinite programming. SIAM Journal on Imaging 
Sciences, 4(2):543-572, 2011. 


Amit Singer. Angular synchronization by eigenvectors and semidefinite programming. Ap- 
plied and computational harmonic analysis, 30(1):20-36, 2011. 


Jun Sun, Stephen Boyd, Lin Xiao, and Persi Diaconis. The fastest mixing markov process 
on a graph and a connection to a maximum variance unfolding problem. SIAM review, 
48(4):681-699, 2006. 


Mariano Tepper, Anirvan M. Sengupta, and Dmitri B. Chklovskii. Clustering is semidefi- 
nitely not that hard: Nonnegative SDP for manifold disentangling. J. Mach. Learn. Res., 
19:82:1—82:30, 2018. URL http://jml1r.org/papers/v19/18-088. html. 


Michael J Todd. Semidefinite optimization. Acta Numerica, 10:515-560, 2001. 


Alexandre B. Tsybakov. Optimal rate of aggregation. In Computational Learning Theory 
and Kernel Machines (COLT-2003), volume 2777 of Lecture Notes in Artificial Intelli- 
gence, pages 303-313. Springer, Heidelberg, 2003. 


44 


LEARNING WITH SDP ESTIMATORS 


Alexandre B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Ann. 
Statist., 32(1):135-166, 2004. ISSN 0090-5364. doi: 10.1214/aos/1079120131. URL 
http: //dx.doi.org/10.1214/aos/1079120131. 


Sara A. van de Geer. Applications of empirical process theory, volume 6 of Cambridge Series 
in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 
2000. ISBN 0-521-65002-X. 


Aad W. van der Vaart and Jon A. Wellner. Weak convergence and empirical processes. 
Springer Series in Statistics. Springer-Verlag, New York, 1996. ISBN 0-387-94640-3. With 
applications to statistics. 


Vladimir N. Vapnik. Statistical learning theory. Adaptive and Learning Systems for Signal 
Processing, Communications, and Control. John Wiley & Sons, Inc., New York, 1998. 
ISBN 0-471-03003-1. A Wiley-Interscience Publication. 


Roman Vershynin. High-dimensional probability: An introduction with applications in data 
science, volume 47. Cambridge University Press, 2018. 


V Vu. Singular vectors under random perturbation. Preprint available in arXiv:104.2000, 
2010. 


Irène Waldspurger, Alexandre d’Aspremont, and Stéphane Mallat. Phase recovery, max- 
cut and complex semidefinite programming. Mathematical Programming, 149(1-2):47-81, 
2015. 


Kilian Q Weinberger and Lawrence K Saul. An introduction to nonlinear dimensionality 
reduction by maximum variance unfolding. In AAAI, volume 6, pages 1683-1686, 2006a. 


Kilian Q Weinberger and Lawrence K Saul. Unsupervised learning of image manifolds by 
semidefinite programming. International journal of computer vision, 70(1):77—-90, 2006b. 


Kilian Q Weinberger, Benjamin Packer, and Lawrence K Saul. Nonlinear dimensionality 
reduction by semidefinite programming and kernel matrix factorization. In AISTATS, 
volume 2, page 6. Citeseer, 2005. 


Henry Wolkowicz. Semidefinite and Lagrangian relaxations for hard combinatorial problems. 
In IFIP Conference on System Modeling and Optimization, pages 269-309. Springer, 1999. 


Henry Wolkowicz, Romesh Saigal, and Lieven Vandenberghe. Handbook of semidefinite pro- 
gramming: theory, algorithms, and applications, volume 27. Springer Science & Business 
Media, 2012. 


M. Xu, V. Jog, H. Sun, and PL. Loh. Optimal rates for community estimation in the 
weighted stochastic block model. Annals of statistics, 2020. 


Y. Yu, T. Wang, and R. J. Samworth. A useful variant of the Davis-—Kahan theorem 
for statisticians. Biometrika, 102(2):315-323, 2015. doi: 10.1093/biomet/asv008. URL 
http: //dx.doi.org/10.1093/biomet/asv008. 


45 


CHRETIEN, CUCURINGU, LECUE, NEIRAC 


Alp Yurtsever, Joel A. Tropp, Olivier Fercoq, Madeleine Udell, and Volkan Cevher. Scalable 
semidefinite programming. SIAM J. Math. Data Sci., 3(1):171-200, 2021. doi: 10.1137/ 
19M1305045. URL https: //doi.org/10.1137/19M1305045. 


S. Zhang and Y. Huang. Complex quadratic optimization and semidefinite programming. 
SIAM Journal on Optimization, 16(3):871-890, 2006. 


Shuzhong Zhang. Quadratic maximization and semidefinite relaxation. Mathematical Pro- 
gramming, 87(3):453-465, 2000. 


46 


LEARNING WITH SDP ESTIMATORS 


Appendix A. Proofs of Theorem 1 and Theorem 2 
Proof of Theorem 1 


Denote r* = r*(A). Assume first that r* > 0 (the case r* = 0 is analyzed later). Let Q* be 
the event onto which for all Z e C if (EA, Z* — Z} < r* then (A — EA, Z — Z*) < (1/2)r*. 
By Definition of r*, we have P[Q*|] > 1 — A. 

Let Z € C be such that (EA, Z* — Z) > r* and define Z’ such that Z’ — Z* = 
(r*/(EA, Z* — Zy) (Z — Z*). We have (EA, Z* — Z') = r* and Z’ e C because C is 
star-shaped in Z*. Therefore, on the event 0*, (A — EA, Z’ — Z*) < (1/2)r* and so 
(A — EA, Z — Z*) < (1/2)(EA, Z* — Z}. It therefore follows that on the event 0*, if Z e C 
is such that ( EA, Z* — Z) > r* then 











































































































(A, Z — Z*) < (-1/2)(EA, Z* — Z} < —r*/2 











which implies that (A, Z- Z*) < 0 and therefore Z does not maximize Z > (A, Z ) over 
C. As a consequence, we necessarily have ( EA, Z* -Ê ) < r* on the event Q* (which holds 
with probability at least 1 — A). 

Let us now assume that r* = 0. There exists a decreasing sequence (rn)n of positive 
real numbers tending to r* = 0 such that for all n > 0, P[Q,] > 1—A where Qn is the 
event Qn = {Y(rn) < 0/2} where for all r > 0, 


























U(r) = = sup (A -— EA, Z - Z*). 
$ eC:(EA,Z*—Z) <r 














Since C is star-shapped in Z*, w is a non-increasing function and so (Qn)n is a decreasing 
sequence (i.e. Qn41 C Qn for all n > 0). It follows that P[A,2Q,] = lim, P[Q,] > 1— A. 
Let us now place ourselves on the event 4,Q,. For all n, since Q, holds and rn > 0, we 
can use the same argument as in first case to conclude that ( BA, Z*- Ż ) < rn. Since the 
latter inequality is true for all n (on the event A,02,) and (rn)n tends to zero, we conclude 
that (EA, Z* —Z) < 0 = r*. a 


























Proof of Theorem 2 


Let r* = r(A). First assume that r* > 0. Let Z € C be such that G(Z* — Z) > r*. Let 
f :A€ [0,1] ~ G(A(Z* — Z)). We have f(0) = G(0) = 0, f(1) = G(Z* — Z) > r* and f 
is continuous. Therefore, there exists Ap € (0,1) such that f(Ao) = r*. We let Z’ be such 
that Z' — Z* = \o(Z — Z*). Since C is star-shapped in Z* and Ag € [0,1] we have Z’ € C. 
Moreover, G(Z* — Z') = r*. As a consequence, on the event Q* such that for all Z e C if 
G(Z* — Z) < r* then (A — EA, Z — Z*) < (1/2)r*, we have (A — EA, Z’ — Z*) < (1/2)r*. 
The latter and Assumption 2 imply that, on 0%, 
































(1/2)r* > (A, Z' — Z*) + (EA, Z* — Z') > (A, Z' — Z*) + G(Z* — Z') > (A, Z' — Z*) + r“ 











and so (A, Z’ — Z*) < —r*/2. Finally, using the definition of Z’, we obtain 


(A, Z — Z*) = (1/0) (A, Z’ — Z*) < —r*/(2X9) < 0. 
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In particular, Z cannot be a maximizer of Z > (A, Z ) over C and so necessarily, on the 
event Q*, G(Z* — Z) < r*. 

Let us now consider the case where r* = 0. Using the same approach as in the proof of 
Theorem 1, we only have to check that the function 





1 
p:r>0—>- sup (A-EA,Z — Z*) 
T Zec:G(Z*—Z)<r 











is non-increasing. Let 0 < rı < rə. W.l.o.g. we may assume that there exists some Zə € C 
such that G(Z* — Z2) < r2 and w(re) = (A — EA, Z2 — Z*) /r2. If G(Z* — Z2) < rı 
then (ra) < (ri/ra)v(r1) < Y(rı). If a — Z2) > rı then there exists Ag € (0,1) 












































such that for Zı = Z* + ie — A we have G(Z* — Z1) = rı and Zı € C. Moreover, 
= G(Ao(Z* — Z2)) < ApG(Z* — Z2) < Aorg and so ào È> 11/72. It follows that 
1 1 rı 
= —( A — EA, Zo — Z*) = A-EA, Zi - Z*) < < 
W(r2) am , Z2 ) TE. ,Zı ) ee w(r1) 


where we used that y(r) > 0 for all r > 0 because Z* € {Z e C : G(Z* — Z) < r} for all 
r>0. = 


Appendix B. Additional proofs for Signed Clustering 
Proof of Equation (10) 


We recall that the cluster matrix Z € {0,1}"*” is defined by Z;; = 1 if i ~ j and Zi; = 0 
when i ~ j and a = ô(p + q — 1). For all matrix Z e [0,1]"*", we have 


















































(Z, EA — aJ) = a Ligh E Aij = a) = By Zi;( E Aij 7 a) ok a Zi;( E Aij z a) 


ij=l (i,jyeCct (i,j)eC- 
=[5(p-1)-a] S) ¥y+[6(Qq¢-1)-o] $) Zy+(1-a)} Za 
(i j)ECH:i#j (4,7)EC— i=1 
(i,j )eCcr (i,j EC i= 1 


The latter quantity is maximal - Z e [0,1]”*” such that Z;; = 1 for (i, j) €C* and Zij = 0 
for (i, j) € C7, that is when Z = Z. As a consequence {Z} = argmaxzejo, 1x» (Z, EA-aJ). 














Moreover, Z e C c [0,1]”*” so we also have that Z is the only solution to the problem 
maxzec (Z, EA — aJ) and so Z = Z*. E 
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Proof of Proposition 2: control of Sı(Z) adapted from Fei and Chen (2019b) 


The noise matrix W is symmetric and has been decomposed as W = Y + W! where Y has 
been defined in (62). For all Z € C a (Z* +rB}*"), we have 


Si(Z) = (P(Z — Z*),W) = (P(W), Z — Z*) 

= (UU'W,Z — Z*) + (WUU! , Z — Z*) — (UU 'WUU',"Z — Z*) 

= 2(UU'W,Z — Z*) — (UU! WuU',Z — Z*) 

= 2(UU'W,Z— Z*) + 2(UU' Y! , Z — Z*) — (UU! (Y + w')UU' "Zz - Z*) 

= 2(UU'W,Z— Z*) +2(UU Ww", Z — Z*) —2(uU YUU], Z — Z*) 

= (UU! Y, Z — Z*) + X UU Ww", Z -— Z*) —2(UU' WY, (Z — Z*)UU'). (52) 























An upper bound on S1(Z) follows from an upper bound on the three terms in the right side 
of (52). Let us show how to bound the first term. Similar arguments can be used to control 
the other two terms. 

Let V := UU?W. Let us find a high-probability upper bound on the term (UU'W, Z— 
Z*) = (V, Z — Z*} uniformly over Z € Cn (Z*+rBp*"). For all k e [K], i € C; and j € [n], 
we have 





Therefore, given j € [n] the Vj;’s are all equal for 7 € Cp. We can therefore fix some arbitrary 
index i, € Ck and have Vj; = Vj,; for all i € Cy. Moreover, (Vi,; : k € [K], j € [n]) is a 
family of independent random variables. We now have 


VjZ2-Z)= YY VuZ a= YS Dup Ca 2 


ke[K] i€Ck je[n] ke[K] je iECk 


Dy Ving WR 


ke[K] je[n] 


which is a weighted sum of nk independent centered random variables X;; := lkVipj with 
weights wz; = (1/lk) Diec, (Z — Z* )ij for k e [K], j € [n]. We now idenfity some properties 
on the weights w,;. 

The weights are such that 
È Dl YY DZ- l-z- zh < SA 
ke[K] je[n 


ke[K] je[n] ] iE€Ck 


which is equivalent to say that the weights vector w = (wx; : k € [K], j € [n]) is in the 
é®" pall (car K/n)B¥”. It is also in the unit ¢“"-ball since for all k € [K] and j € [n], 


(Z = 2" )as| 
|we,4| < F i, HON |Z Z* loo <1 








iECk 
We therefore have w € BE" a (eyrK/n) BK" and so 
sup V,Z-Z*) < sup b Xk jWk j- (53) 
ZeCoa(Z*+rB?””™) weBE"a(cairK/n) BR" ke[K],je[n] 
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It remains to find a high-probability upper bound on the latter term. We can use the 
following lemma to that end. 


Lemma 1 Let Xkj = Juec, Yiz for (k,j) € [K] x [n]. For all0 < R < Kn, if |R] > 
2eKn exp(—(9/32)np/K) then with probability at least 1 — ([R]/2eKn) F, 


2eK 
sup » Xk jWk j S 4V8c0 ny) 72 e( Ž *). 


weBE" ARB" (k,j)eLK] x [n] [R] 








Proof of Lemma 1. Let N = Kn and assume that 1 < R < N. We denote by Xf > 
Xj >- ,> Xý (resp. wf >--: > wy) the non-decreasing rearrangement of |Xp,;| (resp. 
|wk, j|) for (k, j) € [K] x [n]. We have 


N 
sup y Xk jWkj < sup > XP w; 
weBY RBS (k jje[K]x[n] eBay a 
[R] N [R] 
< sup Y Xfur + sup 2i X;w; < J X7 + RXfra D 
we BY j=1 weRBN i= [R]+1 i=1 


Moreover, for all 7 > 0, using a union bound, we have 


[R] 

P| >) X¥>7)=P (30 c [K]x [n]:|J|=[R] and A ler 
i=1 (k,j)eI 

=P max p? Uk jXkj >T 


Ic[K]x[n]:||=[R] uk j= a LEEI (A i 





< D D Pl Y Xymj>r 


Ic[K]x[n]:|I|=[R] wef +1}! (k,j)eL 
= x 5 P D ys Wij Ur j > T 
Ic [K]x[n]:|1|=[R] ue{+1}RI (k,j)EI teCy 


Let us now control each term of the latter sum thanks to Bernstein inequality. The random 
variables (W;; : t,7 € [n]) are independent with variances at most p = 6 max(1 — 6(2p — 
1)?, 1 — ô(2q — 1)?) since Var(W;;) = 0 when i > j and Var(W;;) = Var(A;; — E[Aj;]) = 
Var(Aij) < p for j > i. Moreover, |V,;| = 0 when j < i and |W4;| = |Wij| = |Aij —EA;;| < 2 
for j > i because Ajj E€ {—1,0,1}. It follows from Bernstein’s inequality that for all I c 
[K] x [n] satisfying |Z| = [R] and u e {+1}!l that 





























Be 22 
ot à S (ame + zp) =, Gaerne + mA 
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when 7T < (3/2)[R|conp/K. Therefore, SUP yeBN ARBN X Xk jWej < 27 with probability at 


least 
1~ (fa) ee (Graeme) > e (Sema) 


n 2eN 
(3/2)[ Rleanp/K > r > Vše Ry) 72 10 (5) 
which is a non vacuous condition since [R] > 2eN exp(—(9/32)np/K). The result follows, 
in the case 1 < R < N, by taking T = V/8co[R],/(np/K) log (2eN/[R]) and using that 
2R > |R| when R>1 
For 0 < R < 1, we have 








when 





sup 5 Xkjwkj= R max  |Xk;l 


weBE" ARBE” (k,j)e[K] x[n] (k j)e[K]x [n] 


and Ne Bernstein inequality as above we get that with probability at least 1—exp(—K7?/(8conp)), 
MaX(k jelk] x[n] |Xk,j| < T when 3conp/(2K) > T > y/8conplog(nK)/K which is a non vac- 

uous ee when oe 4K log(nK). By taking T = 4/8conplog(nK)/K, we obtain, 

that for all 0 < R <1, if 9conp > 4K log(nK) then with probability at least 1 — 1/(nK), 








8conp log(nKk) 


sup 5 Xk jWkj SR K 


wEBS” ARBI” (k jje[K]x[n] 


We apply Lemma 1 for R = cır K/n to control (53): 


[1 
K 2eK ark n 
P sup (V, Z - Z*} < coor ZP log — >1- en! 
ZeCa(Z*+rB?*”) n jars | 2eKn 
when [cir k/n| > 2eKnexp(—(9/32)np/K). 
Using the same methodology, we can prove exactly the same result for the quantity 


sup (uU'w',Z-Z*). 
ZECO(Z*+rBi*") 














We can also use the same method to upper bound SUP zecn(Z*+rB?*") (UU'W", (Z—Z*)UU"), 
we simply have to check that the weights vector w = (wp; : k e [K],j € [n]) where 
wy = (1/lk) Nice, (Z — Z*)UU! ];j is also in BA” a (car K /n) BS” for any Z e C such that 
|Z — Z*||, <r. This is indeed the case, since we have for all i € [n], k’ e [K] and j € Cw, 
[(Z — Z*)UU" iz = Xp-1(Z — Z*)ip(UU")pj3 = pec, (Z — Z*)ip/lw which is therefore 
constant for all elements in j € Cy. Therefore, we have 


y Like DY Die Dea 








[n] K] k'e Tk] | KY ich pEC y 
2 are 
[K] k K] jECy ¥ K" ieCy ah 
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and for all k’ € [K] and j € Cy, 


1 
bois = |G È È Z-Z < IZ- Zee <I 


Therefore, w e BE” a^ (yr K /n) BK” and we obtain exactly the same upper bound for the 
three terms in (52). This concludes the proof of Proposition 2. 


Proof of Proposition 3: control of the S2(Z) term from Fei and Chen (2019b) 


In this section, we prove Proposition 3. We follow the proof from Fei and Chen (2019b) 
but we only consider the “dense case” which is when név > logn — we recall that v = 
max(2p — 1,1 — 2q). For a similar uniform control of S2(Z) in the “sparse case ”, when 
co < nov < logn for some a constant co, we refer the reader A Fei and Chen (2019b). 

For all Z € C, we have S2(Z Sl Pal (Z — Z*), W) = (PHZ pe because, by con- 
struction of the projection cen P+(Z*) = 0. Therefore, S2(Z < |P-(Z I, |Wllop 
where ||-||,, denotes the nuclear norm (i.e. the sum of singular values) = l-lop denotes the 
operator norm (i.e. maximum of the singular value). In the following Lemma 2, we prove 
an upper bound for [PZ ) l. and then, we will obtain a high-probability upper bound onto 
IWlop: 


Lemma 2 For all Z e C a (Z* + rBi*"), we have 





PED, = TPH) < 12 - 2h < 


Proof Since Z > 0 so it is for (I, -UU')Z(I, — UU!) and so P+(Z) = (In — UU! )Z (In — 
UU!) > 0 therefore ||P“ (Z) = Tr(P+(Z)). Next, we bound the trace of P+(Z). 

Since I, — UU! is a projection operator, it is symmetric and (I, —UU')? = In —UU', 
moreover, Tr(Z) = n = Tr(Z*) when Z €C so 


The (Z) = Tr(PHZ — Z*)) = Tr((In — UU! )(Z — Z*)(In —- UU! )) 
= Tr((In — UU7)?(Z — Z*)) = Tr((In — UUT)(Z — Z*)) 
= Tr(Z) — Tr(Z*) + Tr(UUT(Z* — Z)) = X (UUT) (Z* = Z)ij 

tj 


- Dp? a Ly! (a =a 


K i jECh 








aK T 
< 2 >, Z*- 2l < = Z- 2" hy 


[K]i a7 jECk 


where we used in (i) that for i and j in a same community, we have Z% = land Z;j € [0,1], 
thus (Z* — Z)i; = |(Z* — Z)i;|.- Finally, when Z is in the localized set C ^ (Z* + rBi”*”), 
we have |Z — Z*||, <r which concludes the proof. = 
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Now, we obtain a high-probability upper bound on ||W|| op: In the following, we apply 
this result in the “dense case” (i.e. ndv > logn) to get the uniform bound onto $2(Z) over 
ZECN(LZ*+rBz*"). 


Lemma 3 (Lemma 4 in Fei and Chen (2019b)) With probability at least 1—exp(—ôvn), 


IWlop S 16v vn + 1684/log(n 


Proof Let A’ be an independent copy of A and R e R”*” be a random symmetric matrix 
independent from both A and A’ whose sub-diagonal entries are independent Rademacher 
random variables. Using a symmetrization argument (see Chapter 2 in Koltchinskii (2011) 
or Chapter 2.3 in van der Vaart and Wellner (1996)), we obtain for W = A — EA, 


D EA- 4) o Rlop a 






































N 















































E|Wlop = E |A- EA']op < EJA- A E |A o Rlop 


lop 
where o is the entry-wise matrix multiplication operation, (i) comes from Jensen’s inequality, 
(ii) occurs since A — A’ and (A — A’) o R are identically distributed and (iti) is the triangle 
inequality. Next, we obtain an upper bound onto E |A o R op: 

We define the family of independent random variables (&; : 1 < i < j < n) where for 
all<i<j<n 


























with probability Ee 








Ey, | 


bij = ea with probability ee (54) 


0 with probability 1 — EAj;. 


























We also put bi; := 4/\EA;;| for all 1 < i < j < n. It is straightforward to see that 
(jb 1 1 <i <j <n) and (Ay Ry: 1 <i <j < n) have the same distribution. As 
a consequence, ||Ao Rilo, and |X|, have the same distribution where X e R"*” is a 
symmetric matrix with Xj; = &jbj; for 1 <i < j < n. An upper bound on E |X lop follow 
from the next result due to Bandeira and Van Handel (2016). 














Theorem 9 (Corollary 3.6 in Bandeira and Van Handel (2016)) Let &;,1 < i 
j < n be independent symmetric random variables with unit variance and (bij, 1 <i < j < 
M be a family of real numbers. Let X € R"*” be the random symmetric matrix defined by 


= &;ijbij for alll < j<n. Leto := = maxicicn { 5r b2 \ Then, for any a > 3, 


“IN 


jal “ij 














2 
D IXllop < eo sile + 14a max ekes loan) 


l<i<j<n 


where, for any q > 0, |-||, denotes the Lg-norm. 











Since (&; : 1 < i < j < n) are independent symmetric such that Var(&;) = E[ A =] we 


can apply Lemma 9 to upper bound E |X oy = E |A 0 Rllop- We have |léijbiilafalogin] S 1 









































for any a > 3 and bj = = 


E [Wop < 2e? [2vn Sv + 42,/log(n |< < 8Vndv + 168,/log(n). (55) 


KA;;| < ôv. It therefore follows from Lemma 9 for a = 3 that 
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The final step to prove Lemma 3 is a concentration argument showing that |W|op is 
close to its expectation with high-probability. To that end we rely on a general result for 
Lipschitz and separately convex functions from Boucheron et al. (2013). We first recall that 
a real-valued function f of N variables is said separately convex when for every i = 1,...,N 
it is a convex function of the i-th variable if the rest of the variables are fixed. 


Theorem 10 (Theorem 6.10 in Boucheron et al. (2013)) Let X be a convex compact 
set in R with diameter B. Let X1,---,Xn be independent random variables taking values 
in X. Let f: XN —R be a separately convex and 1-Lipschitz function, w.r.t. the Ù -norm 
(i.e. |f(x) — f(y)| < lz — yllo for allz,ye XY). Then Z = f(X1,...,Xw) satisfies, for all 
t > 0, with probability at least 1 — exp(—t?/B?), Z < E[Z] +t. 






































We apply Theorem 10 to Z := ||Wop = f(Aij —EAiyj, 1 <i <j <n) EAllop 


- 414- 
where f is a 1-Lipschitz w.r.t. €’-norm for N = n(n —1)/2 and separately convex function 
and (A;; — EAjj,1 <i < j < n) is a family of N independent random variables. Moreover, 
for each i > j, (A — EA);; € [-1 — 6(2p — 1), 1 — 6(2q — 1)], which is a convex compact set 
with diameter B = 2(1 + ô(p — q)) < 4. Therefore, it follows from Theorem 10 that for all 
t > 0, with probability at least 1 — exp(—t?/16), IWlop < El|Wlop + /2t. In particular, 


we finish the proof of Lemma 3 for t = 4v ôvn and using the bound from (55). 7 
It follows from Lemma 3 that when nvd > logn, |W]op < 184Vndv with probability 


at least 1 — exp(—dvn). Using this later result together with Lemma 2 concludes the proof 
of Proposition 3. 












































Appendix C. Proof of Theorem 6 for Angular Synchronization with 
additive noise 


Here we consider the following model: we observe C = Z* + oW, where Z* = 2*(zx*)', 


a* = (e%)" , and W e C”*” is a complex Wigner matrix (ic. W = Ww’, its above- 
diagonal entries are complex numbers whose real and imaginary parts are independent 
normally distributed random variables with mean zero and variance 1/2, and its diagonal 
entries are zero). 


Let us first show that Z* is the oracle in our approach. That is to show that 














Ze argmax( EC, ZY (56) 
€ 


where C := {Z €H, : Z > 0, diag(Z) = 1n}. 
We recall that the offsets are 6;; = 6; — 0; [27] for i,j =1,...,n. Let 71,- .,Yn € [0, 27[ 
and define z; = ei i = 1,...,n. We have for all i,j =1,...,n, 


ĝ 














ği- 15 = 64; [27] e ei — Miiel=~0 —s Zdj — ti 4 ECijtj — Ti = 0 


It follows that 














argmin | 5 |ECijxj — ap) = Caa : 6o € [0, 27)}. (57) 


xeC”:|x;|=1Vi ij=l 
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Let us now rewrite the latter optimization problem as a SDP problem. 
Let x € C” be such that |z;| = 1 for all i =1,...,n. We have 






















































































n n 
> | EC 552; — xil? = y (EC, jx; = zi)( ECijtj = £i) 
i, j=1 i, j=1 
n n 
= D | ECi 2; |" + |x| a ( EC 452; + ECijzjTi) = 2n? — 2R D EC 452524 
ij=1 | se ‘j=l 


























=1 
= R? — 2R (z! ECx) = 2n? — T 'ECrz, 




















where we used in the last inequality that E[C] = Z* = v*z*' so that z 'ECz = (x, 2*)/? € 
R. Next, we see that Z'ECx = (EC, X) where X = x@!, hence, minimizing x € C” > 
Xaj- ElC]lyr; — x|? over all x e C” such that |z;| = 1 boils down to maximizing X € 
crx” > tr(E[C]X) e R over the set {X = zz! : x e C”, |z;| = 1Vi} = {X e Hn: X > 
0,diag(X) = 1,,rank(X) = 1} where H,, is the set of hermitian matrices of size n x n. 
Then, it follows from (57) that 







































































argmax (EC, X) = {Z*} 
XeEH,:X >0,diag(X)=1, ,rank(X)=1 


ee 
because for all 9 € [0, 27), (e(%+9))”_, (e(i+0))"_, = Z*. The latter inequality is almost 
the result that we want to prove in (56); it only remains to show that the rank one constraint 


may be dropped. We use the same approach as the one we used to show (18). 
First, one can easily check that the following inclusion holds true 


CEC = {Ze Cn: |Zy| <1, Yi € [n]}. 











Second, the objective function Z > R((EC, Z)) is linear (w.r.t. to real coefficient), hence, 
maximizing it over the convex set C’ yields a solution at an extreme point of C’, that is in 
the set of matrices Z € C”*” such that |Z;j;| = 1 for all (i,j). Let X = (e); ;<n with 
0 < Bi; < 2m be an extreme point of C’. We have 





n 


R(( EC, XY) = e( 5 205 = e( 5 ese L y cos(ĝ;j = Bij) < n? 


i, j=1 i, j=1 i, j=1 


























and the maximum is achieved when §;; = 6;;[27] for all (i, j), that is when X = Z*. Since 
Z* e C c C', then Z* is the unique maximizer of Z > (EC, Z} over C. Therefore (56) holds 
and so Z* is also the oracle in this model. 














Curvature of the objective function 


Let Z e C. We can write Z = (zije bi )i 51 ms with 0 < aj < 1 (we recall that C c C’ 
defined above) and £;; € [0,27). On one side, we have 


n 


(EC, Z* — Z) = R((EC,Z* — Z)) = #( 5 ECy(Z*— Z) 


i,j=1 


= e( Dea a aye) = *( paces ayeta] = X, (1 — zij cos(5ij — Bij)). 
‘j=l 


i, j=1 i, j=1 
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On the other side, we have 


n n 
|Z" — 212 = $ NZ- Dy = Y e —ayeP Y nyt 8p 














ij=1 i, j=1 i, j=1 
n n 
= (1 — zij cos( bij — bij)” + xi, . sin? (Bij — ôiz) ee = 254; cos( Bij — 04) + x.) 
ijal j=1 
n 
<2 ` (1 — zij cos(Bij — ôij)) = 2( EC, Z* — Z) 
ij=l 


where we used in the last but one inequality that 0 < zij < 1. 
We conclude that the excess risk satisfies the following curvature: for all Z € C, 

















1 
(EC, Z* — Z) > 5 |X" Xe. (58) 
That is Assumption 1 holds with G : M e C"*” —> (1/2) | MÈ. 


Three upper bounds on the fixed point ré(A) in the angular group 
synchronization model with additive noise 


According to our methodology (associated with Corollary 1), we need to calculate the 
following fixed point: let A € (0,1), and consider 














r@(A) - iat }r> 0: sup (C- 20,2 ~ 2") <12) 1-a 


ZeC:G(Z*—Z)<r 





= int {> 0: P( sup (oW, Z- 2") <r/2) > 1-a} 


ZEC, 


where C, := {Z e C : |Z — Z*||, < V2r}. 
For pedagogical purposes, we show how to perform this computation via three different 
means, yielding three different results. We obtain the following upper bounds 


rž (exp(=n?/2)) < 320°n?, rä(exp(—n/2)) < 36KGon*/? and ré,(5 exp(—n/2)) < 2(200)?n 
Each of the three bounds follows from a different strategy. The first one is based on the 
inclusion C, c Z* + v 2rB3””, the second one on C, c C and is therefore the approach that 
we called “global”, and the last one follows from the strategy used in Fei and Chen (2019b) 
that we already used for the signed clustering problem. 
FIRST UPPER BOUND ON THE FIXED POINT rý USING C, C Z* + V2rBy*" 
In this section, we use the following inclusion to obtain the result 

Cr c Z* A) 2p Ba; (59) 
where B*" is the Euclidean ball in R"*”. We have 

sup (oW,Z-Z*\<o sup (W,Z) = ov2r |W]. 


ZeC:|Z-Z* |3 <v2r Zev2rB}*” 
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Next, we use Borell’s concentration inequality for Gaussian processes Ledoux (2001): 
for all u > 0, with probability at least 1 — exp(—u?/2), we have that 


sup 4/ E(W, Zyu <nt+u. 


ZeBy*” 


























[W] <E 








As a consequence, for A = exp(—n?/2) and r = 320°n?, we have, with probability at least 
1-A, 
sup (oW, Z — Z*) < nov 2r < 1/2. 
ZEC:||Z—Z* ||,<V2r 


Hence, one has r(A) < 320?n?. 
We apply Corollary 1 to get that with probability at least 1 — exp(—n?/2) 











EC, Z* — Z) < r&(exp(—n?/2)) < 320?n?. (60) 











IN 
~ 





Next, we know that the oracle Z* is the rank-one matrix v*x*' which has n as its 
largest eigenvalue and associated eigenspace {Az* : A€ C}. In particular, Z* has a spectral 
gap g = n. Let z € C” be a top eigenvector of Z with norm |Z| = yn. It follows from 
Davis-Kahan Theorem (see, for example, Theorem 4.5.5 in Vershynin (2018) or Theorem 4 
in Vu (2010)) that there exists an universal constant co > 0 such that 


ae 


in “Vale 


where g = n is the spectral gap of Z*. Using (61), we conclude that, with probability at 
least 1 — exp(—n?/2), it holds true that 


eu 

















’ 


2 














zeC: Cll= 1 


min |z- zx* ||, < 8cooV/n. 
zeC:|z|=1 


SECOND UPPER BOUND ON THE FIXED POINT re: THE GLOBAL APPROACH 


One may wonder what type of result we can get for the angular group synchronization 
problem with additive noise using the global approach. It is the aim of this last section to 
answer this question. 

As for the community detection problem we will use Grothendieck inequality for the 
global approach. As in (59), the global approach is also using an inclusion of the localized 
set Cr, but unlike (59), we just drop off the localization: we are simply using C, c C. We 
have that 


sup (oW,Z-Z*\ < ERO Wi Z=) < 2K§&o |W || 
ZeC:|Z—Z* |a < v2r 


cut? 


where we used Grothendieck’s inequality as in (42) but in the complex case (K S denoting 
Grothendieck’s constant in the complex case). Here the cut norm in the complex case is 
defined as 


(W |l cert = sup | 2i Wij sit}] 


8;,tjEC:|s;|=|t;|=1 
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We therefore end up with the computation of the cut norm of the noise as in Guédon and 
Vershynin (2016) for the community detection problem. 

Here the noise being Gaussian, the cut norm |W)... is the supremum of a Gaussian 
process for which we can use Borell’s concentration inequality (see Ledoux (2001)) to get 
for all u > 0, with probability at least 1 — exp(—u?/2), 


























Iw] 











cut tu sup pe Wij Si yf < IW leut + un. 


8;,tjEC:|s;|=|tj;|=1 


cut S 


Next, we use Slepian’s lemma (see Chapter 3 from Ledoux and Talagrand (1991)) to handle 
the complexity term E cut: First, we need to upper bound the canonical metric associ- 
ated with the Gaussian process: for every si, sp tj tj € C: |si] = |t;| = |s;| = |¢j| = 1 we 
have 






































2 
E 2 Wijsitj — ` Wigs = D | sit; — sit? 
ij ij ij 
=) |si(tj — t) — (s; — sitil? < 2n (5 lt; — t? +>) |si — “] 
- 7 - 
E| X gls: — 84) + X nlt- 
i j 


where (gi)i, (nj); are i.i.d. M (0,1). It follows from Slepian’s lemma that 


cut S V2nE sup (Zas si — s; )+ mls = t) j| < 4v 22nn. 


si,tjEC: |s;| =|; |= 1 


















































Together with Borell’s inequality above for u = y/n, we obtain that with probability at least 
t= exp(—n/2), |W leut S < 9n3/?. 

As a consequence, for A’ = exp(—n/2), we have r%,(A’) < 36K§on*/?. It follows from 
Corollary 1 that with probability at least 1 — exp(—n/2) 














2 y 
< (EC, Z* — Z) < r&(exp(—n/2)) < 36K &§on*!?, (61) 


min |z- za*\|, < co,/ 36K Son™4, 
zeéC:|z|=1 


We conclude that the global approach is better than the local approach using the inclu- 
sion in (59). 











and so 


‘THIRD UPPER BOUND ON THE FIXED POINT rē: END OF THE PROOF OF THEOREM 6 


The final approach is based on a decomposition from Fei and Chen (2019b) that we already 
used for the signed clustering problem. Here, for the angular group synchronization prob- 
lem, the projection operator is simpler since Z* is the rank-one matrix z*xz* and all the 
processes are Gaussian processes. 
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In order to work with independent random variables, we consider the following matrix 
We cr 
Wi ifi<s j 
Vii = i . 2 
K f 0 otherwise, (62) 
where 0 entries are considered as independent Gaussian variables with 0 variance and there- 
fore, ¥ has independent Gaussian entries, and satisfies the relation W = WV + van 
Like we did for the signed clustering problem, we decompose the inner product (W, Z— 
Z i into two parts according to the SVD of Z*. We know that Z* is the rank-one matrix 


Les vn, then v := 2*/,/n is a unit singular vector of Z* and we define the following 
projection operator 


P: M e C” —> vö M + Mvo! — vi! Moo!" 
and its associated orthogonal projection 


PŁ: M eC" > M — P(M) = (In — võ! )M (Ip — vd"). 





For any Z € Cr := C ^ {Z* + V/2rB}*"}, we consider the following decomposition as in Fei 
and Chen (2019b) 
(W, Z — Z*) = (P(Z — Z*), W) + (PHZ — Z*), W). 
Oe 
S1(Z) S2(Z) 
Next, we upper bound with large probability each of the two last terms uniformly over all 
Z € Cr. We start with the S;(Z) term: for any Z € C,, we have 
Si(Z) = (W,P(Z — Z*)) = (P(W), Z — Z*) 
= (vo W, Z — Z*) + (Woo! Z — Z*) — (vo Woa! , Z — Z*) 
= 2(v0! W, Z — Z*) — (vo Wvo! , Z — Z*) 
= 2(v5 Y, Z — Z*) + (v0 T , Z — Z*) — (vo (Y +T yoo", Z — Z*) 
= vo Y, Z — Z*) + (vo W', Z — Z*) — 2(vo' Woo", Z — Z*) 




















= 2(v0' Y, Z — Z*) + (vo! ET, Z — Z*) — Avo", (Z — Z*)vv"). 


Then, bounding separately each of those three terms will lead us to a bound for S1(Z). Let 
us show how to bound the first term. Similar arguments can be used to control the other 
two terms. 


We define V := vv! W, so that Vij = Dip VideVag = Dik; VviükWkj. We want to find 
a high-probability upper bound on (V, Z-Z S To that end we simply use the inclusion 
Cr c Z* + +/2rBy*” to get 


sup(V,Z—Z*)< sup (V, Z) =v2r |V] 
ZEC, ZeV/2rBy*" 
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so that 






































E sup (vz — z*)| < V2rE|V |, < V2r | X >) l:l lu; E] Yz]? 


ZECr ij k<j 
i 1/2 
= V2 — } < V2rn. 


EIV, Z — Z*)|?] = E[|(W, vo" (Z — Z*))| a T(Z — Z*))yl? 


Moreover, we have 





























= =a Ukl Z — Z*)k 
Now, we apply Borell’s inequality (see, for instance, Theorem 4.1 in Ledoux (2001) 


or page 56-57 in Ledoux and Talagrand (1991)): for u = n, with probability at least 
1 — exp(—u?/2), 


2 
*) |Z- 2" |p _ 2r 
< SYN U2 2a < ZA FE <®, 


ij k 



































V,Z- Z*\ <E vza gy | + EI\(V,Z — Z*)/2 
sup (V; Z — Z") < E| sup(V,2~2")] + sup yE [KVZ = 24) Ju 


< Varn +4] Zu = 22m, 


Similar calculus yield to the same upper bounds for the two other terms. Therefore, we 
obtain, with probability at least 1 — 3 exp(—n?/2), it holds true that 


sup S1(Z) < 6V2rn. (63) 


ZECy 
Now, it remains to control the second $2(Z) term. For any Z € Cr, we have 
S2(Z) = (W,P*(Z — Z*)) = (W,P*(Z)) < [PHZ], IW lop 
Since Z > 0, we have (I, — vt')Z(I, — vi!) > 0, that is P+(Z) > 0 and so 


|P+(z)|, = (P+ (Z)) Ë T(PHZ = Z*)) = Teln — vd") (Z = Z*)(In — v0")) 


Ou = wl \(Z= 2) S77) = Ge (= 2) 





(iii) 
Tr(vo! ( Oo < llel lZ -= Z* ll 


where (i) is due to the fact that P+(Z*) = 0 by construction, (ii) holds because (I; — vo! ) 


is Hermitian and (In — v0! )? = (In — vū! ) and (iii) holds since Tr(Z) = Tr(Z*) = 1 for 
any Z E€ Cr. 
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Moreover, it follows from Davidson and Szarek (2001) that, for all u > 0, with probability 
at least 1 — 2exp(—u?/2), |W\lop < 2Vn +u. We conclude that with probability at least 
1 — 2exp(—2n), we have 

sup S2(Z) < 4v 2nr. (64) 
ZECr 


It follows from (63) and (64) that with probability at least 1 — 5 exp(—n/2) 


sup (W, Z— Z*) < 10V2rn. 
ZECr 
Then, for A = 5exp(—n/2) and r = 2(200)?n we have, with probability at least 1 — A, 
supzec, (oW, Z — Z*) < r/2. Hence, we conclude that r(A) < 2(200)?n. 
Now, we apply Corollary 1 to get that, with probability at least 1 — 5 exp(—n/2) 


1 


2 
ae Z* ,< 2(200)?n. 














Next, we know that the oracle Z* is the rank-one matrix a*x*' which has n for largest 
eigenvalue and associated eigenspace {Az* : A e C}. In particular, Z* has a spectral gap 
g=n. Let z € C” be a top eigenvector of Z with norm |Z, = Vn. 

It follows from Davis-Kahan Theorem (see, for example, Theorem 4.5.5 in Vershynin 
(2018) or Theorem 4 in Vu (2010)) that there exists an universal constant cp > 0 such that 


r* 


T 

z 
y/n y/n 
where g = n is the spectral gap of Z*. Using the previous inequality, we conclude that, 
with probability at least 1 — 5 exp(—n/2) 


Co 


2 9 





ZE 














’ 


2 














min 
zeC:|z|=1 


min ||% — zæ*|l < 40co0. 
zeC:|z|=1 


Appendix D. Solving SDPs in practice 


The practical implementation of our approach to the problems of synchronization, signed 
clustering and MAX-CUT resorts to solving a convex optimization problem. In the present 
section, we describe the various algorithms we used for solving these SDPs. 


Pierra’s method 


For SDPs with constraints on the entries, we propose a simple modification of the method 
initially proposed by Pierra in Pierra (1984). Let f: R”*” — R be a convex function. Let C 
denote a convex set which can be written as the intersection of convex sets C = S10- ASJ. 
Let us define H = R”*” x --- x R"*” (J times) and let D denote the (diagonal) subspace 
of H of vectors of the form (Z,...,Z). In this new formalism, the problem can now be 
formulated as a minimization problem over the intersection of two sets only, i.e. 


























TE F f(Zj) : Z = (Zi) € (S1 x x SJ) A o). 
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Define F(Z) = 4 4 f(Zj). The algorithm proposed by Pierra in Pierra (1984) consists 
of performing the following iterations 


zeti = POX). xs, +heF (BP) and Ber = Projp(Z?*?). 


Pierra’s method can be shown to converge in the setting of our finite dimensional experi- 
ments using (Martinet, 1972, Chapter V). 


APPLICATION TO COMMUNITY DETECTION 


Let us now present the computational details of Pierra’s method for the community detec- 
tion problem. We will estimate its membership matrix Z via the following SDP estimator 


Z € argmaxy.¢(A, ZY, 
where C = {Z e R"™",Z > 0,Z > 0,diag(Z) < In, Zij; < A} and A = $j Žij = 
yy |C}|? denotes the number of nonzero elements in the membership matrix Z. The 
motivation for this approach stems from the fact that the membership matrix Z is actually 
the oracle, i.e., Z* = Z , where Z* € argmaxz_-c(E[A], Zy. The function f to minimize in 
the Pierra algorithm is defined as f(Z) = —<¢A, ZY. 
Let us denote by S, the set of symmetric positive semi-definite matrices in R”*”. The 


set C is the intersection of the sets 


Sı = S4; So ={ZeR™" |Z 20}; S3={ZeR"” | diag(Z) < I}; 


and s= | Ze 5, Zasa}, 


i, j=1 














We now compute for all B = (Bij e (R"*")4 and j =1,...,4 (J = 4 here) 


PrOX Fe pai tem (B); = Proxy, + Def (Bj). 
We have for J = 4 


. € 1 € 
Proxy, + Ler (Bi) = areminzes, — 55 (A,Z) + 5 |Z- By = Ps, (Bj + 55 A) 


On the other hand, the projections operators Ps,,7 = 1,2,3,4 are given by 
Pg, (Z1) = U max {,0}U', where Z; has eigenvalue decomposition Zı = UXU', 
Ps, (Z2) = max {Z2,0}, Pg,(Z3) = Z3 — diag(Z3) + min {1, diag(Z3)} , 
A 
Poal Za) E aes 
g pe (Za)ij 


To sum up, Pierra’s method can be formulated as follows. 


Z4. 


For all iterations k in N, compute the SVD of BY + 54.4 = U*E*(U*)'. Then compute 
for all j = 1,...,4 





1 € € € 
k+1 _ 4[ 77k k kyT ee k, EA g Eae 
B; (o max {© o} U ) + max {BS + 57,0} + B$ + 5A diag(B3 + 3774) 
+ min {1, diag(Bį + =—A)} + i Bir 
2-4 Ža (Bi + 74i 2-4 
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APPLICATION TO SIGNED CLUSTERING 


Let us now turn to the signed clustering problem. We will estimate its membership matrix 
Z via the following SDP estimator Z € argmaxz.¢ (A, ZY, where C = {Z e R"™*": Z > 
0, Zi; € [0,1], Za = 1,2 = 1,...,n}. As in the community detection case, the function f to 
minimize in the Pierra algorithm is defined as f(Z) = —<¢A, Z). 

Let us denote by S, the set of symmetric positive semi-definite matrices in R”*”. The 
set C is the intersection of the sets S$; = S4}, S2 = {Ze R”"*”" | Ze[0,1]"*"} and $3 = 
{Z eR” | Zii = 1, (= lye Tiks 

As before, for j = 1,...,3 


€ 
Proxy, + eg (Bi) = Ps; (B; te 2.3 A) 


and the projection operators Ps,, 7 = 1,2,3 are given by 
Ps, (Z1) = Umax {Z,0}U', Ps,(Z2) = min {max {Z2,0},1} and Ps,(Z3) = Z3—diag(Z3) + I 


To sum up, Pierra’s method can be formulated as follows. 


At each iteration k, compute the SVD of BY + 334 = Uk>k(U*)'. Then compute for 
all j =1,...,3 


k+1 _ 
B; = 


(e max {f 5¥,0} U* + min {max {B$ + = A,o} ; 1} 


+B 9S A diag (+ 955 4) +1); 


The Burer-Monteiro approach and the Manopt Solver 


To solve the MAX-CUT and Angular Synchronization problems, we rely on MANOPT, 
a freely available MATLAB toolbox for optimization on manifolds Boumal et al. (2014). 
MANOPT runs the Riemannian Trust-Region method on corresponding Burer-Monteiro non- 
convex problem with rank bounded by p as follows. The Burer-Monteiro approach consists 
of replacing the optimization of a linear function (A, Z ) over the convex set Z = {Z > 0: 
A(Z) = b} with the optimization of the quadratic function (AY,Y) over the non-convex 
set VY = {Y e R™P: A(YYT) = b}. 

In the context of the MAX-CUT problem, the Burer-Monteiro approach amounts to 
the following steps. Denoting by Z the positive semidefinite matrix Z = zz", note that 
both the cost function and the constraints lend themselves to be expressed linearly in terms 
of Z. Dropping the NP-hard rank-1 constraint on Z, we arrive at the well-known convex 
relaxation of MAX-CUT from Goemans and Williamson (1995) 


Ze argmin( A, Z), 
Zec 


where C := {Z e R” : Z > 0, Zu = 1, Vi = 1,..., n}. 
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If a solution Z of this SDP has rank 1, then Z = z*2*" for some z*, which then gives 
the optimal cut. Recall that in the general case of higher rank Z, 


Ye argmin(AY,Y), (65) 
XeB 


where B := {Y e R"”? : diag(Y YT) = 1}. Note that the constraint diag(YY7) = 1 requires 
each row of Y to have unit @ norm, rendering Y to be a point on the Cartesian product of 
n unit spheres sp! in RP, which is a smooth manifold. Also note that the search space of 
the SDP is compact, since all Z feasible for the SDP have identical trace equal to n. 

If the convex set Z is compact, and m denotes the number of constraints, it holds true 
that whenever p satisfies plp t1) > m, the two problems share the same global optimum 
Barvinok (1995); Burer and Monteiro (2005). Building on pioneering work of Burer and 
Monteiro (2005), Boumal et al. (2016) showed that if the set Z is compact and the set Y 
is a smooth manifold, then pitt) > m implies that, for almost all cost matrices A, global 
optimality is achieved by any Y satisfying a second-order necessary optimality conditions. 
Following Boumal et al. (2016), for p = [V2n], for almost all matrices A, even though 
(65) is non-convex, any local optimum Y is a global optimum (and so is Z = YY"), and 
all saddle points have an escape (the Hessian has a negative eigenvalues). Note that for 
p > n/2 the same statement holds true for all A, and was previously established by Boumal 
(2015). 
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